How 2025 took AI from party tricks to production tools
Source: Dev.to

This blog post was authored by Piotr Migdal.
Overview
What began 2025 as bold experiments became the industry standard by year’s end. Two paradigms drove this shift:
- Reasoning models – spending tokens to think before answering.
- Agentic tool use – executing code to interact with the world.
This subjective review of LLMs for software engineering covers three stages:
- the experimental breakthroughs of the first half of 2025,
- the production struggles where agents were often too chaotic to be useful, and
- the current state of practical, everyday tools.
First half of 2025
January
- DeepSeek released the first open‑source reasoning model, DeepSeek‑R1, sharing both weights and know‑how. It broke the paradigm that AI would remain an oligopoly of proprietary models. Previously we only had o1, released in September 2024 by OpenAI.
February
- Andrej Karpathy coined the term “vibe coding” for programming where we primarily use plain language rather than code.
- OpenAI released GPT‑4.5 – a real marvel. While it was closed‑source and nothing matches its ability to brainstorm (more frank, less reserved, creative, adjustable), I miss it. It was expensive ($2 per single run in Cursor) but unparalleled at advanced translations.
- OpenAI released Deep Research, which spends time doing multiple searches and summarizing them. Initially costly and slow, but still saved time on web search.
- Anthropic released a command‑line tool for agentic coding Claude Code as a research preview.
March
- ARC‑AGI‑2 attempted to create a test for AI that is impossible to solve. Top models achieved ~1 % performance.
- OpenAI released its 4o Image Generation model, flooding the web with Studio Ghibli pastiches.
April
- OpenAI released o4‑mini, a smart yet reasonably fast reasoning model. In a brief conversation it explained Einstein’s General Theory of Relativity to me – a topic I had struggled to understand despite many approaches.
May
- Google released Veo 3, allowing us to create videos that are sometimes hard to distinguish from real recordings.
June
- Gemini 2.5 Pro brought Google back to the AI game.
- With Gemini 2.5 Flash we finally had a model good at summarization and data extraction, yet fast and cheap.
July
- DeepMind achieved gold‑level performance at the International Mathematical Olympiad.
From worldwide achievement to everyday production
And that was just the first half of 2025.
Progress arrived with significant caveats. We saw impressive demos and breakthroughs that often failed in production:
- Too slow or costly – Early reasoning models (o1) and web‑search agents (Deep Research) were powerful but impractical for daily loops.
- Over‑caffeinated AI agents – Tools like early Claude Code (with Sonnet 3.7) were as likely to wreak havoc on your codebase as to fix it.
- The uncanny valley – Image generators (initial 4o Image Generation and Nano Banana) created stunning visuals but were unreliable for complicated instructions or text rendering.
The potential was undeniable, but extracting it required heavy lifting: extensive prompt engineering beforehand and rigorous auditing afterwards. It felt like managing an intern who needs constant supervision rather than collaborating with a capable colleague.
For pragmatists who ignore benchmarks and hype, the calculation is simple: does the tool improve net efficiency? A model that performs a task—a technical feat in itself—is useless if it demands more time in manual cleanup than it saves.
Now
Many research achievements from the first half of 2025 have become daily tools.
Reasoning is mainstream
The first reasoning model was OpenAI o1, released Dec 2024. Thanks to DeepSeek‑R1, other labs could move forward, making reasoning both smarter and faster. Today all leading models support it, especially the flagship ones:
Deep Research
What was once costly with Deep Research is now an everyday search capability offered by any major AI provider—ChatGPT, Google Gemini, etc. The peak performance of reasoning models from early 2025 is now much faster and cheaper, making “thinking before answering” a default part of most workflows.
Search‑augmented AI
The paradigm has shifted: searching is now a tool that can be used iteratively and combined with other actions. Modern models no longer hallucinate wildly; they can web‑search and fact‑check themselves.
Open‑source models are back in the game
- Dec 2024 – DeepSeek released the first open‑source model that could compete with proprietary offerings.
- Since then, many more have appeared:
| Model | Link |
|---|---|
| DeepSeek | |
| Kimi‑K2 Thinking | |
| MiniMax‑M1 | |
| GLM‑4.7 | |
| Mistral 3 | |
| OpenAI OSS models |
AGI benchmarks
- ARC‑AGI‑2 –
- Humanity’s Last Exam (HLE) –
Results by the end of 2025:
| Benchmark | Model | Score |
|---|---|---|
| HLE (Scale leaderboard) | Gemini 3 Pro | 37 % |
| ARC‑AGI‑2 (leaderboard) | Gemini 3 Pro | >30 % |
| ARC‑AGI‑2 | Claude Opus 4.5 | ~40 % |
| ARC‑AGI‑2 | GPT‑5.2 | >50 % |
These tests were designed to be hard and long‑lasting, yet they were surpassed faster than expected.
Agentic coding
- Claude Code – now de‑facto AGI for coding. It can write, run, and debug code, call external APIs, and integrate with any workflow.
- First noticed on Hacker News:
- Development story: “How Claude Code is built” by Gergely Orosz –
Model evolution
| Model | Characteristics |
|---|---|
| Claude Sonnet 3.7 | Awkward, prone to breaking code |
| Claude Sonnet 4 | More stable, faster |
| Claude Opus 4 | Stronger but slower & expensive |
| Claude Sonnet 4.5 | Same power as Opus 4, much faster |
| Claude Opus 4.5 | Same speed as Sonnet 4.5, smarter |
What you need: a strong model, long context window, and tool‑calling capability. With Opus 4.5 you get high performance at a rapid pace.
Competing tools
- Codex CLI – OpenAI
- Gemini CLI – Google
- Cursor CLI – Cursor
See a broader evaluation in Migrating CompileBench to Harbor: standardizing AI agent evals.
Image generation
Nano Banana Pro –
- Moves beyond concept‑art images to generate infographics and charts.
- Results are factually correct thanks to web‑search integration.
You can embed it in an agentic workflow via Antigravity or Claude Skills.
Advanced uses
AI is no longer just for math homework or competition‑style research; it’s becoming a productivity partner.
- Quantum‑computing researcher Scott Aaronson –
- Fields Medalist Terence Tao –
Both use AI to push the frontiers of their fields. Mistakes still happen, but in expert hands the technology becomes even smarter.
Conclusion
2025 was the most intense year yet for AI development. Many once‑only‑demo technologies have become standard tools for everyday work.
I’ve only scratched the surface of model releases, demos, and papers. For deeper insight, check out:
- 2025 LLM Year in Review by Andrej Karpathy –
- 2025: The year in LLMs by Simon Willison –
- AI News (daily newsletter) –
Even as someone whose job revolves around AI, keeping up with the rapid pace is a full‑time challenge.