Why Accuracy Lies — The Metrics That Actually Matter (Part 4)

Published: (December 2, 2025 at 10:18 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

Cover image for Why Accuracy Lies — The Metrics That Actually Matter (Part 4)

Accuracy is the most widely used metric in machine learning.
It’s also the most misleading. In real‑world production ML systems, accuracy can make a bad model look good, hide failures, distort business decisions, and even create the illusion of success before causing catastrophic downstream impact.

Accuracy is a vanity metric. It tells you almost nothing about real ML performance.

The Accuracy Trap

Accuracy formula

Accuracy = Correct predictions / Total predictions

When accuracy breaks

  • Classes are imbalanced
  • Rare events matter more
  • Cost of mistakes is different
  • Distribution changes
  • Confidence matters

Most real ML use cases suffer from one or more of these issues.

Classic Example: Fraud Detection

  • Dataset: 10,000 normal transactions, 12 frauds
  • Model: predicts everything as “normal”
Accuracy = 99.88%

The model catches 0 frauds → useless. Accuracy hides the failure.

Why Accuracy Fails

ProblemWhy Accuracy Is Useless
Class imbalanceMajority class dominates
Rare eventsAccuracy ignores minority class
Cost‑sensitive predictionsWrong predictions have different penalties
Real‑world data shiftAccuracy stays the same while failure increases
Business KPIsAccuracy doesn’t measure financial impact

Accuracy ≠ business value.

Metrics That Actually Matter

1. Precision

Definition: Of all predicted positives, how many were correct?

Use when: False positives are costly (e.g., spam detection, fraud alerts).

Formula

Precision = TP / (TP + FP)

2. Recall

Definition: Of all actual positives, how many did the model identify?

Use when: False negatives are costly (e.g., cancer detection, intrusion detection).

Formula

Recall = TP / (TP + FN)

3. F1 Score

Definition: Harmonic mean of precision & recall.

Use when: A balance between precision and recall is needed.

Formula

F1 = 2 * (Precision * Recall) / (Precision + Recall)

4. ROC‑AUC

Measures how well the model separates classes. Common in credit scoring and risk ranking. Higher AUC indicates better separation.

5. PR‑AUC

More informative than ROC‑AUC for highly imbalanced datasets. Used for fraud, rare defects, anomaly detection.

6. Log Loss (Cross Entropy)

Evaluates the correctness of predicted probabilities. Important when confidence matters and probabilities drive decisions.

7. Cost‑Based Metrics

Accuracy ignores cost; real ML does not.

Example

  • False negative cost = ₹5,000
  • False positive cost = ₹50

Formula

Total Cost = (FN * Cost_FN) + (FP * Cost_FP)

Enterprises use such cost‑based calculations to measure real model impact.

How to Pick the Right Metric — Practical Cheat Sheet

Use CaseBest Metrics
Fraud detectionRecall, F1, PR‑AUC
Medical diagnosisRecall
Spam detectionPrecision
Churn predictionF1, Recall
Credit scoringROC‑AUC, KS
Product rankingMAP@k, NDCG
NLP classificationF1
ForecastingRMSE, MAPE

The Real Lesson

Accuracy is for beginners. Real ML engineers choose metrics that reflect business value.

Accuracy can be high while:

  • Profit drops
  • Risk increases
  • Users churn
  • Fraud bypasses detection
  • Trust collapses

Metrics must match:

  • The domain
  • The cost of mistakes
  • The real‑world distribution

Key Takeaways

InsightMeaning
Accuracy is misleadingNever use it alone
Choose metric per use caseNo universal metric
Precision/Recall matter moreEspecially for imbalance
ROC‑AUC & PR‑AUC give deeper insightUseful for ranking & rare events
Always tie metrics to businessML is about impact, not just math

Coming Next — Part 5

Overfitting & Underfitting — Beyond Textbook Definitions
Real symptoms, real debugging, real engineering fixes.

Back to Blog

Related posts

Read more »

The End of the Train-Test Split

Article URL: https://folio.benguzovsky.com/train-test Comments URL: https://news.ycombinator.com/item?id=46149740 Points: 7 Comments: 1...