Introducing GPT-5.3-Codex-Spark
Source: OpenAI Blog
Research Preview: GPT‑5.3‑Codex‑Spark
A smaller, real‑time coding model built in partnership with Cerebras.
📢 What’s new?
- Codex‑Spark is the first model designed for instant‑feedback coding.
- Optimized for ultra‑low‑latency hardware, it can generate > 1,000 tokens / second while staying highly capable on real‑world programming tasks.
- Available now as a research preview for ChatGPT Pro users.
🤝 Partnership with Cerebras
- This launch marks the first milestone of the collaboration announced in January:
OpenAI × Cerebras partnership. - We are working with Cerebras to:
- Scale datacenter capacity.
- Harden the end‑to‑end user experience.
- Deploy larger frontier models in the future.
🛠️ Model capabilities
| Feature | Details |
|---|---|
| Context window | 128 k tokens |
| Output type | Text‑only |
| Primary use‑case | Real‑time code edits, logic reshaping, UI refinements with immediate results |
| Long‑running tasks | Still supported – Codex‑Spark complements existing models that can run autonomously for hours/days/weeks. |
🚀 How to access
- Who can use it? ChatGPT Pro users (research preview).
- Rate limits: Codex‑Spark has its own limits; usage does not count toward your standard ChatGPT quotas.
- Potential throttling: When demand spikes, you may encounter limited access or temporary queuing as we balance reliability across all users.
📋 What we’re looking for
- Developer feedback on real‑time coding workflows.
- Insights on how the model performs for both instant edits and long‑running projects.
- Suggestions for future improvements and feature expansions.
Speed and Intelligence
Codex‑Spark is optimized for interactive work where latency matters as much as intelligence. You can collaborate with the model in real time—interrupting or redirecting it as it works—and rapidly iterate with near‑instant responses.
Because it’s tuned for speed, Codex‑Spark keeps its default working style lightweight:
- Minimal, targeted edits – only the changes you need.
- No automatic test runs – tests are executed only when you request them.
Coding
Codex‑Spark is a highly capable, small model optimized for fast inference. On SWE‑Bench Pro and Terminal‑Bench 2.0—two benchmarks that evaluate agentic software‑engineering capability—GPT‑5.3‑Codex‑Spark demonstrates strong performance while completing the tasks in a fraction of the time compared to GPT‑5.3‑Codex.
Latency Improvements for All Models
While training Codex‑Spark, we discovered that model speed alone isn’t enough for real‑time collaboration. Reducing latency across the entire request‑response pipeline became essential. The following end‑to‑end enhancements have been added to our harness and will benefit all models:
What We Changed
- Streaming pipeline – Optimized how responses flow from client ↔ server.
- Inference stack – Rewrote critical components for faster execution.
- Session initialization – Made the first visible token appear sooner, keeping Codex responsive during iteration.
- Persistent WebSocket connection – Introduced a dedicated, long‑lived channel for communication (enabled by default for Codex‑Spark and soon for every model).
Quantitative Gains
| Metric | Improvement |
|---|---|
| Client/Server round‑trip overhead | ‑80 % |
| Per‑token processing overhead | ‑30 % |
| Time‑to‑first‑token (TTFT) | ‑50 % |
What This Means for You
- Faster feedback – The first token shows up much sooner, improving the interactive feel.
- Smoother iterations – Reduced per‑token latency makes continuous editing feel seamless.
- Unified experience – The WebSocket path will become the default for all models, ensuring consistent performance across the platform.
Powered by Cerealis
Codex‑Spark runs on Cerebras’ Wafer Scale Engine 3 — a purpose‑built AI accelerator for high‑speed inference that gives Codex a latency‑first serving tier. We partnered with Cerebras to add this low‑latency path to the same production serving stack as the rest of our fleet, so it works seamlessly across Codex and sets us up to support future models.
“What excites us most about GPT‑5.3‑Codex‑Spark is partnering with OpenAI and the developer community to discover what fast inference makes possible — new interaction patterns, new use cases, and a fundamentally different model experience. This preview is just the beginning.”
— Sean Lie, CTO and Co‑Founder of Cerebras
- GPUs remain foundational across our training and inference pipelines and deliver the most cost‑effective tokens for broad usage.
- Cerebras complements that foundation by excelling at workflows that demand extremely low latency, tightening the end‑to‑end loop so Codex feels more responsive as you iterate.
- GPUs and Cerebras can be combined for single workloads to achieve the best performance.
Availability & Details
Codex‑Spark is rolling out today as a research preview for ChatGPT Pro users in the latest versions of:
- the Codex app
- the CLI
- the VS Code extension
Because it runs on specialized low‑latency hardware, usage is governed by a separate rate limit that may adjust based on demand during the preview.
API Access
- Currently available to a small set of design partners.
- Goal: understand how developers want to integrate Codex‑Spark into their products.
- Wider access will be expanded over the coming weeks as we tune the integration under real workloads.
Model Capabilities
- Text‑only with a 128 k token context window.
- First model in a family of ultra‑fast models.
- Future enhancements (based on developer feedback) may include:
- Larger models
- Longer context lengths
- Multimodal input
Safety & Evaluation
- Includes the same safety training as our mainline models, covering cyber‑relevant scenarios.
- Evaluated through our standard deployment process, which includes baseline assessments for cybersecurity and other capabilities.
- Determined not to meet the Preparedness Framework threshold for high capability in cybersecurity or biology.
What’s Next
Codex‑Spark is the first step toward a Codex with two complementary modes:
- Longer‑horizon reasoning and execution
- Real‑time collaboration for rapid iteration
Over time, these modes will blend. Codex can keep you in a tight interactive loop while delegating longer‑running work to sub‑agents in the background, or it can fan out tasks to many models in parallel when you need breadth and speed. This means you won’t have to choose a single mode up front.
As models become more capable, interaction speed becomes a clear bottleneck. Ultra‑fast inference tightens that loop, making Codex feel more natural to use and expanding what’s possible for anyone turning an idea into working software.