Understanding Errors in Machine Learning: Accuracy, Precision, Recall & F1 Score
Source: Dev.to
Machine Learning models are often judged by numbers, but many beginners (and even practitioners) misunderstand what those numbers actually mean. A model showing 95 % accuracy might still be useless in real‑world scenarios.
In this post we’ll break down:
- Types of errors in Machine Learning
- Confusion matrix
- Accuracy
- Precision
- Recall
- F1 Score
All explained intuitively, with examples you can confidently use in interviews or projects.
1️⃣ Types of Errors in Machine Learning
In a classification problem, predictions fall into four categories:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
🔴 False Positive (Type I Error)
Model predicts Positive, but the actual result is Negative
Example: An email is marked as Spam but it is actually Not Spam.
🔵 False Negative (Type II Error)
Model predicts Negative, but the actual result is Positive
Example: A medical test says No Disease but the patient actually has it.
These errors directly impact evaluation metrics.
2️⃣ Confusion Matrix (The Foundation)
A confusion matrix summarizes prediction results:
Predicted
+ -
Actual + TP FN
Actual - FP TN
All metrics are derived from this table.
3️⃣ Accuracy
📌 Definition
Accuracy measures how often the model is correct.
📐 Formula
[ \text{Accuracy} = \frac{TP + TN}{TP + FP + FN + TN} ]
❗ Problem with Accuracy
Accuracy can be misleading on imbalanced datasets.
Example
- 99 normal patients
- 1 patient with disease
If the model predicts No Disease for everyone:
[ \text{Accuracy}= \frac{99}{100}=99% ]
The model is dangerous despite the high accuracy. → Accuracy alone is not enough.
4️⃣ Precision
📌 Definition
Of all predicted positives, how many are actually positive?
📐 Formula
[ \text{Precision} = \frac{TP}{TP + FP} ]
🎯 When to focus on Precision?
When False Positives are costly.
Examples
- Spam detection
- Fraud detection
You don’t want to wrongly flag legitimate cases.
5️⃣ Recall (Sensitivity)
📌 Definition
Of all actual positives, how many did the model correctly identify?
📐 Formula
[ \text{Recall} = \frac{TP}{TP + FN} ]
🎯 When to focus on Recall?
When False Negatives are dangerous.
Examples
- Cancer detection
- Accident detection
Missing a positive case can have severe consequences.
6️⃣ Precision ↔ Recall Trade‑off
Increasing Precision often decreases Recall, and vice‑versa.
| Scenario | Priority |
|---|---|
| Spam filter | Precision |
| Disease detection | Recall |
| Fraud detection | Recall |
This trade‑off leads us to the F1 Score.
7️⃣ F1 Score
📌 Definition
The harmonic mean of Precision and Recall.
📐 Formula
[ \text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} ]
✅ Why F1 Score?
- Balances Precision & Recall
- Works well for imbalanced datasets
- Penalises extreme values (if either Precision or Recall is low, F1 drops sharply)
8️⃣ Summary Table
| Metric | Best Used When | Focus |
|---|---|---|
| Accuracy | Balanced data | Overall correctness |
| Precision | False Positives costly | Prediction quality |
| Recall | False Negatives costly | Detection completeness |
| F1 Score | Imbalanced data | Balanced performance |
9️⃣ Real‑World Case Studies
Understanding metrics becomes clearer when we map them to real‑world problems. Below are some common, interview‑relevant case studies.
🏥 Case Study 1: Disease Detection (Cancer / COVID)
- Scenario: Model predicts whether a patient has a disease.
- Critical error: False Negative – predicting Healthy when the patient is actually sick.
- Why Recall matters more: Missing a sick patient can delay treatment and cost lives. Some false alarms (FPs) are acceptable.
Primary metric: Recall
💳 Case Study 2: Credit‑Card Fraud Detection
- Scenario: Model identifies fraudulent transactions.
- Critical error: False Negative – fraud marked as legitimate.
- Trade‑off: Too many FP annoy customers; too many FN cause financial loss.
Best metric: F1 Score (balances FP and FN costs)
📧 Case Study 3: Spam Email Detection
- Scenario: Classify emails as Spam or Not Spam.
- Critical error: False Positive – important email marked as spam.
- Why Precision matters: Users may miss critical emails (job offers, OTPs, invoices).
Primary metric: Precision
🚗 Case Study 4: Autonomous Driving (Pedestrian Detection)
- Scenario: Detect pedestrians using camera and sensor data.
- Critical error: False Negative – pedestrian not detected.
- Why Recall is crucial: Missing even one pedestrian can be fatal.
Primary metric: Recall
🏭 Case Study 5: Manufacturing Defect Detection
- Scenario: Detect defective products on an assembly line.
- Critical error depends on context:
- High FP → waste & increased cost
- High FN → faulty product reaches the customer
- Balanced approach: Use both Precision and Recall.
Best metric: F1 Score
🔚 Final Thoughts
Never blindly trust accuracy. Always ask:
- Which error (FP or FN) is more dangerous?
- Is my dataset imbalanced?
- What is the real‑world cost of a false positive vs. a false negative?
Understanding these metrics lets you choose the right evaluation strategy for any problem.
ics makes you a better ML engineer, not just a model builder.
If this helped you, feel free to share or comment your favorite ML pitfall!