AWS re:Invent 2025 - Own Your AI – Blazing Fast OSS AI on AWS (STP104)
Source: Dev.to
Overview
AWS re:Invent 2025 – Own Your AI – Blazing Fast OSS AI on AWS (STP104)
Fireworks AI presents its open‑source inference and customization platform for building production agents across industries. The speaker outlines challenges in agent development—model selection, latency, quality, and cost—and demonstrates how Fireworks addresses them with FireOptimizer technology (84 000 deployment parameters, custom CUDA kernels, fine‑tuning). Key features include day‑one access to models such as Deepseek and Llama, speculative decoding achieving 70 %+ acceptance rates for sub‑100 ms latency, and reinforcement fine‑tuning showing 20 % quality improvements. The platform runs on AWS infrastructure and supports deployment options from SaaS to air‑gapped environments, serving clients like Notion (100 M+ users, 4× lower latency) and DoorDash (3× faster VLM processing, 10 % cost reduction).
This article is auto‑generated while preserving the original presentation content; minor typos or inaccuracies may be present.
The Challenge of Building Production‑Ready AI Agents
Building agents that work reliably in production involves many hidden difficulties:
- Model choice – closed‑source vs. open‑source; small (5‑8 B) vs. large (trillions of parameters) models.
- Latency – search‑oriented agents often need sub‑second responses; a 300 ms LLM latency is hard to maintain at scale for millions of users.
- Quality – accuracy must meet or exceed closed‑source baselines.
- Cost – uncontrolled usage can cause expenses to balloon quickly.
- Infrastructure complexity – deciding between EKS, ECS, multi‑node GPU clusters, etc.
- Data privacy, compliance, security, and availability – especially for regulated industries.
- ML expertise – many organizations lack in‑house talent to manage the stack.
These pain points lead to error‑prone deployments and high operational overhead. Fireworks AI positions itself as a one‑stop platform that abstracts away these complexities, enabling developers to create “magical AI experiences” without wrestling with the underlying stack.
Fireworks Platform: Open‑Source Inference, Workload Optimization, and Fine‑Tuning
Open‑Source Model Access
Fireworks provides day‑one access to leading open‑source models—including Deepseek, Llama, Mistral, Qwen, vision‑language, voice, and ASR models—via a drop‑in compatible OpenAI SDK. Switching from OpenAI to Fireworks typically requires only two code changes: the model name and an API key.
Workload Optimization (FireOptimizer)
For high‑throughput use cases (e.g., search), FireOptimizer tailors deployment configurations to meet latency targets (≈ 300 ms) while scaling to millions of requests. It supports:
- Speculative decoding for faster token generation.
- Custom CUDA kernels that reduce inference overhead.
- Parameter‑level tuning across 84 000 deployment knobs.
Fine‑Tuning for Domain Specificity
Generic closed‑source models are “one inch deep and miles wide.” Fireworks enables customers to fine‑tune models on proprietary data, eliminating failure modes that arise in niche applications. Reinforcement fine‑tuning has demonstrated up to 20 % quality gains over baseline models.
Deployment Flexibility
The platform runs on AWS, offering:
- SaaS deployments for rapid onboarding.
- Air‑gapped environments for highly regulated workloads.
- Integration with EKS/ECS clusters and GPU‑rich instances.



