The state of AI code reviews: An 18-month retrospective

Published: 1 week ago (December 31, 2025 at 11:09 AM EST)

6 min read

Source: Dev.to

Introduction

It has been almost a year and a half since our company started using AI intensively in our everyday work. We’ve tested various tools—such as Windsurf, GitHub Copilot, Cursor, Claude, and LibreChat—and I can now confidently comment on how AI models affect software engineering as a discipline.

In this article I’ll present my objective opinion on using AI for code reviews.

Note: I’m not an AI specialist. I don’t have the deep technical knowledge to discuss model architectures the way many people on the Internet tend to do these days. It feels like everyone thinks they’re an expert on LLMs, which obviously isn’t realistic.

Why Code Review Still Matters

Code review is a key stage in every software project, and neglecting it often leads to unpredictable and critical problems later on. Since the very beginning of the software era, reviewing written code—regardless of programming language—has been seen as a required step before the code is used in production.

A Historical Example: Apollo Spacecraft Software

Project: Apollo flight software (named Luminary)
Size: > 145 000 lines of code
Process: Every file included comments (see the repository on GitHub) and the entire codebase went through multiple review and approval iterations.

Back then, every line was written, reviewed, and optimized by hand. Even in the early 1960s, processes such as defining requirements, design, coding, testing, and maintenance were strictly followed.

Margaret Hamilton, who led the lab developing the Apollo flight software, once said:
“What became apparent with Apollo—though it is not how it worked—is that it is better to define your system up front to minimize errors, rather than producing a bunch of code that then has to be corrected with patches on patches. It’s a message that seems to have gone unheeded—in this respect, software today is still built the way it was 50 years ago.”

By focusing on finding and fixing errors early, the system was stable enough to handle unexpected CPU overloads just seconds before the lunar landing. No software errors were reported during any of the manned Apollo missions—a remarkable testament to human precision.

The Code Review Pyramid

A well‑known concept in software engineering is the Code Review Pyramid, which illustrates the relative importance of different review aspects:

Functionality & Design
Implementation
Testing
Documentation
Code Style

From my own experience I’ve always applied this hierarchy when reviewing others’ code. The first review, however, always happens on my own code—by me.

Before you push your changes or ask an AI to check them, revisit the pyramid yourself.
Don’t just check for syntax.

Practical Tips for Human Reviewers

Read it as if it’s your first time.
Ask yourself:
- What is confusing?
- What is implicit?
Take a break between writing and reviewing.
- Many developers submit changes at the end of the day, when fatigue leads to overlooked errors.
- Waiting until morning and reviewing with fresh eyes dramatically improves the result.

What AI Can Automate

Several parts of the review process can be automated:

Code style enforcement
Syntax error detection
Test‑coverage verification
Code‑optimization suggestions
General error detection

Strengths of AI in Code Review

Advantage	Description
Immediate feedback	Traditional reviews can take hours or days; AI replies instantly.
No cognitive fatigue	Reviewing hundreds of lines of someone else’s code can be mentally draining. AI never tires.
Productivity scaling	In large teams, human reviewers spend significant time on reviews, impacting productivity. AI’s speed does not degrade with team size.
Consistent detection	AI spots the same problems every single time, especially for conventions or minor flaws.
Scalable throughput	AI can review dozens of pull requests per day without performance loss.

A Cisco study found that reviewing more than 400 lines of code at once reduces a reviewer’s ability to find bugs, with most defects discovered in the first 200 lines. This insight shaped industry practices—but AI doesn’t suffer from this limitation. It can handle large reviews without degradation.

The Missing Piece: Understanding

While AI performs flawlessly on the tasks mentioned above—and I rely on it heavily—it still lacks something fundamental: understanding.

Current AI systems are, in many ways, quite limited.
We’re fooled into thinking they’re intelligent because they handle language so well, yet they don’t understand the physical world.
They lack persistent memory, true reasoning, and long‑term planning—crucial aspects of genuine intelligence.

A Quick Primer on Machine‑Learning Paradigms

1. Supervised Learning

Classical approach.
The model is trained on a dataset of examples that contain both input and the correct output (labels).
Example: Show an image of a table and label it “table”. The model predicts; if it’s wrong, its internal parameters are adjusted. Repeating this millions of times forms strong input‑output associations.

2. Reinforcement Learning

More closely resembles certain aspects of human learning.
Instead of being told the exact right answer, the AI acts, observes the consequences, and receives a reward or penalty.
It adjusts future actions to maximize long‑term reward. Think of how you learned to ride a bike: you try, fall, adjust, and eventually succeed.

(The article continues with further discussion of RL, but the excerpt ends here.)

Closing Thought

AI is an incredibly powerful assistant for code review—fast, consistent, and scalable. Yet, it remains a tool that doesn’t truly understand the code it evaluates. Human reviewers, armed with the Code Review Pyramid and good habits (like taking breaks), still play an essential role in delivering robust, maintainable software.

Reinforcement Learning – Limits of the Paradigm

Through trial, error, and correction, reinforcement learning (RL) can produce impressive results. However, this paradigm has limitations. It’s inefficient and effective only in clearly defined environments (like playing chess, Go, or poker) where success metrics are known and unambiguous. In complex, real‑world settings without clear feedback, reinforcement learning becomes impractical.

Self‑Supervised Learning

This is the foundation of the most recent revolution in AI, including large language models such as ChatGPT. Here, the system learns from unlabeled data by creating its own predictive tasks—for example, trying to predict missing words in a sentence. By training on vast quantities of text, the model builds internal representations of patterns and relationships between words and concepts.

This approach allows AI to gain an impressive ability to generate coherent and context‑aware language.
It still does not give the model genuine understanding or reasoning ability—it is pattern recognition, not comprehension.

Even with all this sophistication, AI remains fundamentally limited by its training data and objectives. It does not understand things as humans do; it merely models statistical relationships.

The Role of Software

Software, by definition, is a tool that must operate according to human needs and actions—sometimes even playing a role in life‑critical systems. This gap is where human consciousness and intuition remain irreplaceable.

Final Thoughts

I’ll end with a visual that perfectly captures how the development landscape has shifted in the past two years.

Writing code itself has never truly been the problem.
The real challenge has always been delivering error‑free and dependable software.

The lesson is clear: AI is a powerful ally—especially for automating repetitive parts of the review process—but the human element remains the final safeguard of wisdom, empathy, and understanding that no algorithm can yet fully emulate.