Leading Inference Providers Cut AI Costs by up to 10x With Open Source Models on NVIDIA Blackwell

Published: 3 days ago (February 12, 2026 at 11:00 AM EST)

7 min read

Source: NVIDIA AI Blog

Why Token Cost Matters

AI‑powered interactions—from diagnostic insights in healthcare to character dialogue in games and autonomous customer‑service agents—are built on the same unit of intelligence: a token.

Scaling AI interactions forces businesses to ask: Can we afford more tokens?
The answer lies in better tokenomics—the practice of reducing the cost of each token.

Recent MIT research shows that advances in infrastructure and algorithms are cutting inference costs for frontier‑level performance by up to 10× annually.

Infrastructure Efficiency = Better Tokenomics

Think of a high‑speed printing press:

If the press can produce 10× more pages with only a modest increase in ink, energy, and machine cost, the cost per page drops dramatically.

Similarly, investments in AI infrastructure boost token output far more than they raise total cost, leading to a meaningful reduction in cost per token.

Token output outpaces infrastructure cost, causing the cost of each token to drop.

Who’s Leading the Charge?

Leading inference providers are already leveraging the NVIDIA Blackwell platform to slash token costs:

Baseten
DeepInfra
Fireworks AI
Together AI

These providers:

Host advanced open‑source models that have reached frontier‑level intelligence.
Combine open‑source intelligence with the extreme hardware‑software codesign of NVIDIA Blackwell.
Deploy their own optimized inference stacks.

The result? Up to 10× lower cost per token compared with the previous NVIDIA Hopper platform, unlocking dramatic savings for businesses across every industry.

Takeaway

Improving tokenomics isn’t just a technical tweak—it’s a strategic lever that lets companies scale AI interactions affordably. By adopting cutting‑edge infrastructure like NVIDIA Blackwell, organizations can deliver richer, more frequent AI experiences while keeping costs in check.

Healthcare – Baseten and Sully.ai

Cutting AI Inference Costs by 10×

In healthcare, tedious, time‑consuming tasks such as medical coding, documentation, and insurance‑form management eat into the time physicians can spend with patients.

Sully.ai tackles this problem by creating “AI employees” that automate routine tasks like medical coding and note‑taking. As the platform grew, its proprietary, closed‑source models introduced three major bottlenecks:

Bottleneck	Impact
Unpredictable latency	Slowed real‑time clinical workflows
Rising inference costs	Costs grew faster than revenue
Limited model control	Inability to fine‑tune quality or updates

Sully.ai builds AI employees that handle routine tasks for physicians

Solution

Sully.ai switched to Baseten’s Model API, which deploys open‑source models (e.g., gpt‑oss‑120b) on NVIDIA Blackwell GPUs. The stack includes:

NVFP4 low‑precision data format for efficient inference
NVIDIA TensorRT‑LLM library for optimized execution
NVIDIA Dynamo inference framework for streamlined deployment

Baseten selected Blackwell GPUs after observing up to 2.5× better throughput per dollar compared with the previous Hopper‑based setup.

Results

Metric	Improvement
Inference cost	↓ 90 % (≈ 10× reduction)
Response time	↑ 65 % faster for critical workflows (e.g., medical‑note generation)
Physician time saved	> 30 million minutes returned

Links

Gaming — DeepInfra and Latitude Reduce Cost per Token by 4×

Latitude is building the future of AI‑native gaming with its AI Dungeon adventure‑story game and the upcoming AI‑powered role‑playing platform Voyage, where players can create or explore worlds with the freedom to choose any action and craft their own story.

The Challenge

Each player action triggers an inference request to a large language model (LLM).
Costs grow with engagement, yet response times must stay fast enough to keep the experience seamless.

The Solution

Latitude runs large open‑source models on DeepInfra’s inference platform, which is powered by NVIDIA Blackwell GPUs and TensorRT‑LLM.

For a large‑scale mixture‑of‑experts (MoE) model, DeepInfra reduced the cost per million tokens:

Platform	Cost / 1 M tokens
NVIDIA Hopper (baseline)	$0.20
Blackwell (FP16)	$0.10
Blackwell (NVFP4, low‑precision)	$0.05

Result: a 4× reduction in cost per token while preserving the accuracy expected by customers.

Benefits for Latitude

Fast, reliable responses even during traffic spikes.
Ability to deploy more capable models without compromising player experience.
Cost‑effective scaling as player engagement grows.

Latitude AI Dungeon – real‑time narrative and imagery generation
Latitude’s text‑based adventure “AI Dungeon” generates narrative text and imagery in real time as players explore dynamic stories.

Agentic Chat — Fireworks AI and Sentient Foundation Lower AI Costs by up to 50 %

Sentient Labs brings AI developers together to build powerful, open‑source reasoning AI systems. Their mission is to accelerate AI progress on harder reasoning problems through research in:

Secure autonomy
Agentic architecture
Continual learning

Sentient Chat

Sentient Chat is the first application from Sentient Labs. It:

Orchestrates complex multi‑agent workflows
Integrates more than a dozen specialized AI agents contributed by the community

Because a single user query can trigger a cascade of autonomous interactions, the service has massive compute demands and can generate costly infrastructure overhead.

How the Cost Savings Were Achieved

Sentient Labs moved to Fireworks AI’s inference platform running on NVIDIA Blackwell GPUs. The Blackwell‑optimized inference stack delivered 25‑50 % better cost efficiency compared with the previous Hopper‑based deployment.

Key outcomes

Higher throughput per GPU → more concurrent users for the same cost
Scalable platform supported a viral launch of 1.8 M wait‑listed users in 24 hours
Processed 5.6 M queries in a single week while maintaining low latency

“With Fireworks’ Blackwell‑optimized inference stack, Sentient achieved 25‑50 % better cost efficiency compared with its previous Hopper‑based deployment.” – Sentient Labs

Visual Overview

Sentient Chat orchestrates complex multi‑agent workflows and integrates more than a dozen specialized AI agents from the community.

Learn More

Read the full story on Fireworks AI.

Customer Service — Together AI and Decagon Drive Down Cost by 6×

Customer‑service calls that use voice AI often end in frustration: even a slight delay can cause users to talk over the agent, hang up, or lose trust.

Decagon builds AI agents for enterprise customer support, and voice is its most demanding channel. The company needed infrastructure that could deliver sub‑second responses under unpredictable traffic loads while keeping token‑economics viable for 24/7 voice deployments.

Solution

Together AI runs production inference for Decagon’s multimodel voice stack on NVIDIA Blackwell GPUs. The two companies collaborated on several key optimizations:

Optimization	Description
Speculative decoding	Smaller “draft” models generate fast responses; a larger model verifies accuracy in the background.
Conversation caching	Frequently repeated dialogue elements are cached to accelerate response generation.
Automatic scaling	Dynamic scaling handles traffic surges without degrading performance.