How 2025 took AI from party tricks to production tools

Published: 1 month ago (January 5, 2026 at 08:16 AM EST)

5 min read

Source: Dev.to

Source: Dev.to

Cover image for “How 2025 took AI from party tricks to production tools

This blog post was authored by Piotr Migdal.

Overview

What began 2025 as bold experiments became the industry standard by year’s end. Two paradigms drove this shift:

Reasoning models – spending tokens to think before answering.
Agentic tool use – executing code to interact with the world.

This subjective review of LLMs for software engineering covers three stages:

the experimental breakthroughs of the first half of 2025,
the production struggles where agents were often too chaotic to be useful, and
the current state of practical, everyday tools.

First half of 2025

January

DeepSeek released the first open‑source reasoning model, DeepSeek‑R1, sharing both weights and know‑how. It broke the paradigm that AI would remain an oligopoly of proprietary models. Previously we only had o1, released in September 2024 by OpenAI.

February

Andrej Karpathy coined the term “vibe coding” for programming where we primarily use plain language rather than code.
OpenAI released GPT‑4.5 – a real marvel. While it was closed‑source and nothing matches its ability to brainstorm (more frank, less reserved, creative, adjustable), I miss it. It was expensive ($2 per single run in Cursor) but unparalleled at advanced translations.
OpenAI released Deep Research, which spends time doing multiple searches and summarizing them. Initially costly and slow, but still saved time on web search.
Anthropic released a command‑line tool for agentic coding Claude Code as a research preview.

March

ARC‑AGI‑2 attempted to create a test for AI that is impossible to solve. Top models achieved ~1 % performance.
OpenAI released its 4o Image Generation model, flooding the web with Studio Ghibli pastiches.

April

OpenAI released o4‑mini, a smart yet reasonably fast reasoning model. In a brief conversation it explained Einstein’s General Theory of Relativity to me – a topic I had struggled to understand despite many approaches.

May

Google released Veo 3, allowing us to create videos that are sometimes hard to distinguish from real recordings.

June

Gemini 2.5 Pro brought Google back to the AI game.
With Gemini 2.5 Flash we finally had a model good at summarization and data extraction, yet fast and cheap.

July

DeepMind achieved gold‑level performance at the International Mathematical Olympiad.

From worldwide achievement to everyday production

And that was just the first half of 2025.

Progress arrived with significant caveats. We saw impressive demos and breakthroughs that often failed in production:

Too slow or costly – Early reasoning models (o1) and web‑search agents (Deep Research) were powerful but impractical for daily loops.
Over‑caffeinated AI agents – Tools like early Claude Code (with Sonnet 3.7) were as likely to wreak havoc on your codebase as to fix it.
The uncanny valley – Image generators (initial 4o Image Generation and Nano Banana) created stunning visuals but were unreliable for complicated instructions or text rendering.

The potential was undeniable, but extracting it required heavy lifting: extensive prompt engineering beforehand and rigorous auditing afterwards. It felt like managing an intern who needs constant supervision rather than collaborating with a capable colleague.

For pragmatists who ignore benchmarks and hype, the calculation is simple: does the tool improve net efficiency? A model that performs a task—a technical feat in itself—is useless if it demands more time in manual cleanup than it saves.

Now

Many research achievements from the first half of 2025 have become daily tools.

Reasoning is mainstream

The first reasoning model was OpenAI o1, released Dec 2024. Thanks to DeepSeek‑R1, other labs could move forward, making reasoning both smarter and faster. Today all leading models support it, especially the flagship ones:

Deep Research

What was once costly with Deep Research is now an everyday search capability offered by any major AI provider—ChatGPT, Google Gemini, etc. The peak performance of reasoning models from early 2025 is now much faster and cheaper, making “thinking before answering” a default part of most workflows.

Search‑augmented AI

The paradigm has shifted: searching is now a tool that can be used iteratively and combined with other actions. Modern models no longer hallucinate wildly; they can web‑search and fact‑check themselves.

Open‑source models are back in the game

Dec 2024 – DeepSeek released the first open‑source model that could compete with proprietary offerings.
Since then, many more have appeared:

Model	Link
DeepSeek
Kimi‑K2 Thinking
MiniMax‑M1
GLM‑4.7
Mistral 3
OpenAI OSS models

AGI benchmarks

ARC‑AGI‑2 –
Humanity’s Last Exam (HLE) –

Results by the end of 2025:

Benchmark	Model	Score
HLE (Scale leaderboard)	Gemini 3 Pro	37 %
ARC‑AGI‑2 (leaderboard)	Gemini 3 Pro	>30 %
ARC‑AGI‑2	Claude Opus 4.5	~40 %
ARC‑AGI‑2	GPT‑5.2	>50 %

These tests were designed to be hard and long‑lasting, yet they were surpassed faster than expected.

Agentic coding

Claude Code – now de‑facto AGI for coding. It can write, run, and debug code, call external APIs, and integrate with any workflow.
- First noticed on Hacker News:
- Development story: “How Claude Code is built” by Gergely Orosz –

Model evolution

Model	Characteristics
Claude Sonnet 3.7	Awkward, prone to breaking code
Claude Sonnet 4	More stable, faster
Claude Opus 4	Stronger but slower & expensive
Claude Sonnet 4.5	Same power as Opus 4, much faster
Claude Opus 4.5	Same speed as Sonnet 4.5, smarter

What you need: a strong model, long context window, and tool‑calling capability. With Opus 4.5 you get high performance at a rapid pace.

Competing tools

Codex CLI – OpenAI
Gemini CLI – Google
Cursor CLI – Cursor

See a broader evaluation in Migrating CompileBench to Harbor: standardizing AI agent evals.

Image generation

Nano Banana Pro –

Moves beyond concept‑art images to generate infographics and charts.
Results are factually correct thanks to web‑search integration.

You can embed it in an agentic workflow via Antigravity or Claude Skills.

Advanced uses

AI is no longer just for math homework or competition‑style research; it’s becoming a productivity partner.

Quantum‑computing researcher Scott Aaronson –
Fields Medalist Terence Tao –

Both use AI to push the frontiers of their fields. Mistakes still happen, but in expert hands the technology becomes even smarter.

Conclusion

2025 was the most intense year yet for AI development. Many once‑only‑demo technologies have become standard tools for everyday work.

I’ve only scratched the surface of model releases, demos, and papers. For deeper insight, check out:

2025 LLM Year in Review by Andrej Karpathy –
2025: The year in LLMs by Simon Willison –
AI News (daily newsletter) –

Even as someone whose job revolves around AI, keeping up with the rapid pace is a full‑time challenge.