Stop Measuring Noise: The Productivity Metrics That Really Matter in Software Engineering
Source: Dev.to
Introduction
Productivity has become a dirty word in engineering.
Mention it in a Slack channel and the immediate assumption is that management is looking for a reason to fire the bottom 10 % or that McKinsey is back with another controversial report. (For what it’s worth, their 2023 report actually warns against using overly simple measurements—such as lines of code produced or number of code commits—rather than recommending them. The backlash came from other aspects of their approach.)
The skepticism is earned. For decades, productivity metrics in software engineering have been weaponised to micromanage individual contributors rather than optimise systems.
But ignoring metrics entirely is just as dangerous. When you run an engineering organisation on vibes and anecdotal evidence, you end up playing a game of Telephone: the reality of what’s happening on the ground gets distorted as it passes through layers of management, and you lose the ground truth.
The question isn’t if we should measure productivity.
The question is what we should measure.
Why the old dashboards are noisy
- In the era of AI‑coding tools, DORA metrics are no longer enough.
- Most standard dashboards are filled with noise.
DORA metrics – a good smoke alarm, not a diagnosis
| Metric | What it tells you | What it doesn’t tell you |
|---|---|---|
| Deployment Frequency | Speed of releases | Why speed is changing |
| Change Lead Time | Time from commit to production | Bottlenecks in the workflow |
| Change Fail Percentage | Stability of releases | Root causes of failures |
| Failed Deployment Recovery Time | How quickly you recover | Systemic issues causing failures |
DORA is excellent at signalling that something is wrong (e.g., a spike in Change Fail Percentage) but it doesn’t explain why.
AI tools have broken “velocity” as a standalone metric
- A developer can now generate a 500‑line PR in 30 seconds with an LLM.
- The coding‑phase cycle time looks incredible, but if that code is a hallucinated mess that clogs the review process for three days, personal velocity has come at the expense of team throughput.
Research highlight – Tilburg University analysing GitHub activity found:
- Less‑experienced developers gain productivity from AI tools.
- Core developers now review 6.5 % more code and see a 19 % drop in their own original‑code productivity.
- The time shift is towards reviewing AI‑generated submissions.
A systems‑thinking approach: three measurement layers
| Layer | What to measure | Example metrics |
|---|---|---|
| Inputs | What we are investing | Headcount, tooling costs, cloud spend |
| Internals | How the work actually happens | PR workflow, rework, focus time, context‑switching |
| Outputs | What we deliver | Reliability, feature adoption, customer value |
Four specific metrics that matter in 2025
1. Rework Rate (most underrated metric)
Definition: Percentage of code that is rewritten or reverted shortly after being merged.
Why it matters: In an AI‑augmented world it’s easy to ship bad code fast. Data from platforms analysing hundreds of engineering teams reveal a U‑shaped curve regarding AI adoption:
| AI usage level | Rework rate |
|---|---|
| Low (manual) | Standard |
| High (boilerplate) | Low (AI excels at unit tests & scaffolding) |
| Hybrid (25‑50 % AI) | Highest |
Red flag: Cycle time improves and rework rate creeps up → you’re building technical debt faster.
Industry data – GitClear’s analysis of 211 M lines of code:
- Code churn projected to double in 2024 vs. 2021 baseline.
- 7.9 % of newly added code revised within two weeks (vs. 5.5 % in 2020).
- Copy‑pasted code rose from 8.3 % → 12.3 %.
Visibility: Tools like Span’s AI Code Detector now identify AI‑authored code with 95 % accuracy (Python, TypeScript, JavaScript), giving you ground‑truth on adoption patterns and quality impact.
2. Shadow‑Work Ratio
Definition: Proportion of time engineers spend on “invisible” work that isn’t captured in tickets or roadmaps.
Typical breakdown (VP of Engineering view)
- 40 % New Features
- 20 % Tech Debt
- 40 % KTLO (Keep The Lights On)
Engineers’ reality
“I’m on the Platform team but spend 20 h/week fixing Checkout bugs because I’m the only one who knows the legacy codebase.”
IDC study – Developer time allocation:
- 16 % on actual application development.
- 84 % on meetings, context‑switching, and “shadow work”.
Three invisible‑work types (Anton Zaides, Engineering Manager)
- Invisible production support – alerts, ad‑hoc requests.
- Technical glue work – code reviews, planning, mentoring, documentation.
- Shadow backlog – off‑record PM requests, “right‑thing‑doing” without approval.
Case study: A senior engineer spent >40 % of their time on invisible work; an internal team logged ≈65 % shadow work with no cost codes or billing.
Red flag: High shadow‑work ratio → capacity is being silently stolen.
3. Focus‑Time Utilisation
Definition: Percentage of uninterrupted time engineers have for deep work (coding, design, problem‑solving).
Why it matters: Context‑switching costs are huge. Studies show a single interruption can add 15‑30 minutes of lost productivity.
How to measure:
- Track calendar “focus blocks” vs. actual meeting time.
- Use IDE plugins that log active coding vs. idle time.
- Correlate with PR throughput and defect rates.
Target: Aim for ≥60 % of weekly hours as protected focus time.
4. AI‑Generated Code Quality Index (AGCQI)
Definition: Composite score that blends rework rate, post‑merge defect density, and AI‑code detection percentage.
Formula (example)
AGCQI = (1 – ReworkRate) × (1 – DefectDensity) × (1 – AI_CodePct)
Interpretation
- Closer to 1 → high quality, low rework, low risky AI code.
- Closer to 0 → frequent rework, many defects, heavy reliance on low‑quality AI output.
Action: Set quarterly thresholds (e.g., AGCQI ≥ 0.85) and investigate any dip.
Putting it all together – a quick‑start checklist
-
Instrument your stack
- Deploy AI‑code detectors (Span, GitHub Advanced Security).
- Enable PR analytics (GitClear, Linear, Jira).
- Capture focus‑time data (Clockify, RescueTime, IDE plugins).
-
Create a dashboard with the four metrics above, broken out by team and by time period (weekly, monthly).
-
Set baseline thresholds (e.g., Rework Red flag: relying solely on project‑management data hides where effort is really going).
New Engineering Intelligence Platforms
Platforms like Span automatically categorize engineering work by analyzing git activity, creating an “automated P&L of engineering time.”
- They answer questions with data instead of guesswork.
- They detect AI‑authored code with high accuracy and correlate it with downstream metrics (rework rate, review cycles, bug density).
Key Metrics to Track
1. Rework Rate
Tracks the relationship between time spent writing code and time spent reviewing it.
- As AI writes code instantly, reviewers become the new bottleneck.
- Tilburg University research: each core contributor now reviews ≈10 additional PRs annually.
2. Investment Distribution
- Platforms like Span categorize work (maintenance, innovation, migrations, etc.) by mining commit, PR, and review activity.
- Example insight: the “Innovation” team spends 70 % of its time on maintenance.
3. Review Burden
- Faros AI analysis: code‑review time ↑ 91 % as PR volume outpaces reviewer capacity.
- PR size ↑ 154 % and bug rates ↑ 9 % in the same period.
| Situation | Indicator | Risk |
|---|---|---|
| Too fast | Massive AI‑generated PRs approved in minutes | Quality issues looming |
| Too slow | Review burden is high | Senior engineers stuck in “Review Hell”, leading to burnout and stalled innovation |
Red flag: “LGTM” culture vs. “Nitpick” culture – balance speed with thoroughness.
4. Fragmented Time
- Measures blocks of deep‑work time (≥ 2 h) vs. time fractured by meetings and interruptions.
Research (UC Irvine, Prof. Gloria Mark):
- Average 23 min 15 s to fully return to a task after an interruption (2020).
- Updated 2023 “Attention Span” study: return time ≈ 25 min; average on‑screen attention dropped from 2.5 min (2004) to 47 s (2021).
- If calendar data shows ≈40 % of engineering capacity lost to context‑switching (the 30‑minute dead zones between meetings), the cheapest productivity win is cancelling meetings. No tool can fix a broken calendar.
Cultural Risks
- Metric misuse: Ranking engineers or incentivising gaming (e.g., splitting one PR into ten tiny ones) destroys trust.
- Golden rule: Metrics are for debugging systems, not people.
| Bad Question | Good Question |
|---|---|
| “Why is Alice slower than Bob?” | “Why is the Checkout Team stuck in code review twice as long as the Platform Team? Do they need better tooling or is tech debt unmanageable?” |
Leaders should aim for a dashboard that acts as a neutral third party, providing objective data to validate what engineers say in 1:1s (e.g., “I’m swamped with maintenance work”).
Implementing the Metrics
| Metric | How to Measure |
|---|---|
| Rework Rate | Track time spent on code authoring vs. time spent on PR reviews (git logs, review timestamps). |
| Investment Distribution | Categorise commits/PRs by purpose (maintenance, feature, migration) using AI‑driven tagging. |
| Review Burden | Count PRs per reviewer, average review time, and PR size; compare against reviewer capacity. |
| Fragmented Time | Pull calendar data (meeting blocks) and calculate uninterrupted windows ≥ 2 h. |
| AI Code Quality | Detect AI‑authored code (Span’s code‑level detection) and correlate with bug density, rework, and review cycles. |
First step: Know how much AI‑generated code you’re actually shipping. Most teams rely on unreliable self‑reported surveys; switch to automated detection.
The Counter‑Intuitive Insight
- It’s not about reducing AI usage.
- Teams that coach engineers to use AI for complete, well‑scoped tasks (instead of mixing human‑AI work) see better outcomes.
The Future of Engineering Intelligence
- Old proxies (lines of code, commit counts) are dead.
- Even modern standards like cycle time are insufficient on their own.
To navigate the next few years, understand the interplay between human creativity and AI leverage. Measure AI code quality, not just volume.
Take Action
- Adopt an engineering intelligence platform (e.g., Span) to get automated, accurate metrics.
- Audit your calendar – eliminate low‑value meetings that fragment deep work.
- Coach engineers on deliberate AI usage for whole tasks.
- Use metrics to debug the system, not to rank people.
If you’re ready to move beyond vanity metrics and gain real insight into your engineering organization, check out Span to see how engineering‑intelligence platforms are helping teams measure what actually matters.