Stop Measuring Noise: The Productivity Metrics That Really Matter in Software Engineering

Published: 2 days ago (December 23, 2025 at 06:55 AM EST)

7 min read

Source: Dev.to

Introduction

Productivity has become a dirty word in engineering.
Mention it in a Slack channel and the immediate assumption is that management is looking for a reason to fire the bottom 10 % or that McKinsey is back with another controversial report. (For what it’s worth, their 2023 report actually warns against using overly simple measurements—such as lines of code produced or number of code commits—rather than recommending them. The backlash came from other aspects of their approach.)

The skepticism is earned. For decades, productivity metrics in software engineering have been weaponised to micromanage individual contributors rather than optimise systems.

But ignoring metrics entirely is just as dangerous. When you run an engineering organisation on vibes and anecdotal evidence, you end up playing a game of Telephone: the reality of what’s happening on the ground gets distorted as it passes through layers of management, and you lose the ground truth.

The question isn’t if we should measure productivity.
The question is what we should measure.

Why the old dashboards are noisy

In the era of AI‑coding tools, DORA metrics are no longer enough.
Most standard dashboards are filled with noise.

DORA metrics – a good smoke alarm, not a diagnosis

Metric	What it tells you	What it doesn’t tell you
Deployment Frequency	Speed of releases	Why speed is changing
Change Lead Time	Time from commit to production	Bottlenecks in the workflow
Change Fail Percentage	Stability of releases	Root causes of failures
Failed Deployment Recovery Time	How quickly you recover	Systemic issues causing failures

DORA is excellent at signalling that something is wrong (e.g., a spike in Change Fail Percentage) but it doesn’t explain why.

AI tools have broken “velocity” as a standalone metric

A developer can now generate a 500‑line PR in 30 seconds with an LLM.
The coding‑phase cycle time looks incredible, but if that code is a hallucinated mess that clogs the review process for three days, personal velocity has come at the expense of team throughput.

Research highlight – Tilburg University analysing GitHub activity found:

Less‑experienced developers gain productivity from AI tools.

Core developers now review 6.5 % more code and see a 19 % drop in their own original‑code productivity.

The time shift is towards reviewing AI‑generated submissions.

A systems‑thinking approach: three measurement layers

Layer	What to measure	Example metrics
Inputs	What we are investing	Headcount, tooling costs, cloud spend
Internals	How the work actually happens	PR workflow, rework, focus time, context‑switching
Outputs	What we deliver	Reliability, feature adoption, customer value

Four specific metrics that matter in 2025

1. Rework Rate (most underrated metric)

Definition: Percentage of code that is rewritten or reverted shortly after being merged.

Why it matters: In an AI‑augmented world it’s easy to ship bad code fast. Data from platforms analysing hundreds of engineering teams reveal a U‑shaped curve regarding AI adoption:

AI usage level	Rework rate
Low (manual)	Standard
High (boilerplate)	Low (AI excels at unit tests & scaffolding)
Hybrid (25‑50 % AI)	Highest

Red flag: Cycle time improves and rework rate creeps up → you’re building technical debt faster.

Industry data – GitClear’s analysis of 211 M lines of code:

Code churn projected to double in 2024 vs. 2021 baseline.

7.9 % of newly added code revised within two weeks (vs. 5.5 % in 2020).

Copy‑pasted code rose from 8.3 % → 12.3 %.

Visibility: Tools like Span’s AI Code Detector now identify AI‑authored code with 95 % accuracy (Python, TypeScript, JavaScript), giving you ground‑truth on adoption patterns and quality impact.

2. Shadow‑Work Ratio

Definition: Proportion of time engineers spend on “invisible” work that isn’t captured in tickets or roadmaps.

Typical breakdown (VP of Engineering view)

40 % New Features
20 % Tech Debt
40 % KTLO (Keep The Lights On)

Engineers’ reality

“I’m on the Platform team but spend 20 h/week fixing Checkout bugs because I’m the only one who knows the legacy codebase.”

IDC study – Developer time allocation:

16 % on actual application development.
84 % on meetings, context‑switching, and “shadow work”.

Three invisible‑work types (Anton Zaides, Engineering Manager)

Invisible production support – alerts, ad‑hoc requests.
Technical glue work – code reviews, planning, mentoring, documentation.
Shadow backlog – off‑record PM requests, “right‑thing‑doing” without approval.

Case study: A senior engineer spent >40 % of their time on invisible work; an internal team logged ≈65 % shadow work with no cost codes or billing.

Red flag: High shadow‑work ratio → capacity is being silently stolen.

3. Focus‑Time Utilisation

Definition: Percentage of uninterrupted time engineers have for deep work (coding, design, problem‑solving).

Why it matters: Context‑switching costs are huge. Studies show a single interruption can add 15‑30 minutes of lost productivity.

How to measure:

Track calendar “focus blocks” vs. actual meeting time.
Use IDE plugins that log active coding vs. idle time.
Correlate with PR throughput and defect rates.

Target: Aim for ≥60 % of weekly hours as protected focus time.

4. AI‑Generated Code Quality Index (AGCQI)

Definition: Composite score that blends rework rate, post‑merge defect density, and AI‑code detection percentage.

Formula (example)

AGCQI = (1 – ReworkRate) × (1 – DefectDensity) × (1 – AI_CodePct)

Interpretation

Closer to 1 → high quality, low rework, low risky AI code.
Closer to 0 → frequent rework, many defects, heavy reliance on low‑quality AI output.

Action: Set quarterly thresholds (e.g., AGCQI ≥ 0.85) and investigate any dip.

Putting it all together – a quick‑start checklist

Instrument your stack
- Deploy AI‑code detectors (Span, GitHub Advanced Security).
- Enable PR analytics (GitClear, Linear, Jira).
- Capture focus‑time data (Clockify, RescueTime, IDE plugins).
Create a dashboard with the four metrics above, broken out by team and by time period (weekly, monthly).
Set baseline thresholds (e.g., Rework Red flag: relying solely on project‑management data hides where effort is really going).

New Engineering Intelligence Platforms

Platforms like Span automatically categorize engineering work by analyzing git activity, creating an “automated P&L of engineering time.”

They answer questions with data instead of guesswork.
They detect AI‑authored code with high accuracy and correlate it with downstream metrics (rework rate, review cycles, bug density).

Key Metrics to Track

1. Rework Rate

Tracks the relationship between time spent writing code and time spent reviewing it.

As AI writes code instantly, reviewers become the new bottleneck.
Tilburg University research: each core contributor now reviews ≈10 additional PRs annually.

2. Investment Distribution

Platforms like Span categorize work (maintenance, innovation, migrations, etc.) by mining commit, PR, and review activity.
Example insight: the “Innovation” team spends 70 % of its time on maintenance.

3. Review Burden

Faros AI analysis: code‑review time ↑ 91 % as PR volume outpaces reviewer capacity.
PR size ↑ 154 % and bug rates ↑ 9 % in the same period.

Situation	Indicator	Risk
Too fast	Massive AI‑generated PRs approved in minutes	Quality issues looming
Too slow	Review burden is high	Senior engineers stuck in “Review Hell”, leading to burnout and stalled innovation

Red flag: “LGTM” culture vs. “Nitpick” culture – balance speed with thoroughness.

4. Fragmented Time

Measures blocks of deep‑work time (≥ 2 h) vs. time fractured by meetings and interruptions.

Research (UC Irvine, Prof. Gloria Mark):

Average 23 min 15 s to fully return to a task after an interruption (2020).

Updated 2023 “Attention Span” study: return time ≈ 25 min; average on‑screen attention dropped from 2.5 min (2004) to 47 s (2021).

If calendar data shows ≈40 % of engineering capacity lost to context‑switching (the 30‑minute dead zones between meetings), the cheapest productivity win is cancelling meetings. No tool can fix a broken calendar.

Cultural Risks

Metric misuse: Ranking engineers or incentivising gaming (e.g., splitting one PR into ten tiny ones) destroys trust.
Golden rule: Metrics are for debugging systems, not people.

Bad Question	Good Question
“Why is Alice slower than Bob?”	“Why is the Checkout Team stuck in code review twice as long as the Platform Team? Do they need better tooling or is tech debt unmanageable?”

Leaders should aim for a dashboard that acts as a neutral third party, providing objective data to validate what engineers say in 1:1s (e.g., “I’m swamped with maintenance work”).

Implementing the Metrics

Metric	How to Measure
Rework Rate	Track time spent on code authoring vs. time spent on PR reviews (git logs, review timestamps).
Investment Distribution	Categorise commits/PRs by purpose (maintenance, feature, migration) using AI‑driven tagging.
Review Burden	Count PRs per reviewer, average review time, and PR size; compare against reviewer capacity.
Fragmented Time	Pull calendar data (meeting blocks) and calculate uninterrupted windows ≥ 2 h.
AI Code Quality	Detect AI‑authored code (Span’s code‑level detection) and correlate with bug density, rework, and review cycles.

First step: Know how much AI‑generated code you’re actually shipping. Most teams rely on unreliable self‑reported surveys; switch to automated detection.

The Counter‑Intuitive Insight

It’s not about reducing AI usage.
Teams that coach engineers to use AI for complete, well‑scoped tasks (instead of mixing human‑AI work) see better outcomes.

The Future of Engineering Intelligence

Old proxies (lines of code, commit counts) are dead.
Even modern standards like cycle time are insufficient on their own.

To navigate the next few years, understand the interplay between human creativity and AI leverage. Measure AI code quality, not just volume.

Take Action

Adopt an engineering intelligence platform (e.g., Span) to get automated, accurate metrics.
Audit your calendar – eliminate low‑value meetings that fragment deep work.
Coach engineers on deliberate AI usage for whole tasks.
Use metrics to debug the system, not to rank people.

If you’re ready to move beyond vanity metrics and gain real insight into your engineering organization, check out Span to see how engineering‑intelligence platforms are helping teams measure what actually matters.