The Math You Actually Need as a Data Engineer
Source: Dev.to
Introduction
Let’s address the elephant in the room.
“Do I need to be good at math to become a data engineer?”
I’ve heard this question hundreds of times from students and career‑changers. The answer might surprise you:
You need less math than you think — but more than zero.
Data engineering is not data science. You won’t be deriving gradient‑descent formulas or proving statistical theorems. But you will encounter mathematical concepts daily, often without realizing it.
In this article I’ll cover exactly what you need to know — practical mathematics that makes you a better engineer, without the academic overhead.
Let me be direct. Here’s what matters:
| Concept | How Often You’ll Use It |
|---|---|
| Set Theory | Daily |
| Boolean Logic | Daily |
| Basic Statistics | Weekly |
| Aggregations & Arithmetic | Daily |
| Probability Basics | Occasionally |
| Linear Algebra | Rarely |
| Calculus | Almost never |
If you struggled with advanced mathematics in school, don’t let that stop you. Most data‑engineering math is intuitive once you see how it applies.
Set Theory
Every SQL query you write is set theory in action. If you understand sets, you understand SQL. It’s that fundamental.
A set is a collection of distinct elements.
Set A = {1, 2, 3, 4, 5}
Set B = {4, 5, 6, 7, 8}
These map directly to SQL:
| Set Operation | SQL Equivalent | Result |
|---|---|---|
| Union | UNION | All elements in A or B |
| Intersection | INNER JOIN | Elements in both A and B |
| Difference | EXCEPT / LEFT JOIN | Elements in A but not B |
| Cartesian Product | CROSS JOIN | All combinations of A and B |
Example Queries
-- Union: Combine two sets
SELECT customer_id FROM online_orders
UNION
SELECT customer_id FROM store_orders;
-- Intersection: Find common elements
SELECT a.customer_id
FROM online_orders a
INNER JOIN store_orders b ON a.customer_id = b.customer_id;
-- Difference: Find elements only in first set
SELECT customer_id FROM online_orders
EXCEPT
SELECT customer_id FROM store_orders;
Visualising joins as Venn diagrams helps immensely:
- INNER JOIN – only the overlap
- LEFT JOIN – all of left + overlap
- RIGHT JOIN – all of right + overlap
- FULL OUTER JOIN – everything
If set theory clicks, SQL becomes intuitive.
Boolean Logic
Every WHERE clause, IF statement, and CASE expression is Boolean logic.
| Operator | Meaning | Example |
|---|---|---|
AND | Both must be true | status = 'active' AND age > 18 |
OR | At least one must be true | country = 'US' OR country = 'CA' |
NOT | Inverts the condition | NOT status = 'deleted' |
Truth‑Table Summary
| Expression | Result |
|---|---|
TRUE AND NULL | NULL |
FALSE AND NULL | FALSE |
TRUE OR NULL | TRUE |
FALSE OR NULL | NULL |
NOT NULL | NULL |
NULL = NULL | NULL (not TRUE) |
Tip:
WHERE column = NULLnever works. UseWHERE column IS NULLinstead.
De Morgan’s Laws
NOT (A AND B) = (NOT A) OR (NOT B)NOT (A OR B) = (NOT A) AND (NOT B)
Practical Example
-- Both statements are equivalent
WHERE NOT (status = 'active' AND region = 'EU')
WHERE status != 'active' OR region != 'EU'
Understanding De Morgan’s Laws helps you debug filters that aren’t behaving as expected.
Descriptive Statistics
You don’t need a statistics degree, but you must understand a few descriptive measures.
| Measure | What It Tells You | SQL |
|---|---|---|
| Mean | Average value | AVG(column) |
| Median | Middle value | Varies by DB (e.g., PERCENTILE_CONT(0.5)) |
| Mode | Most frequent value | GROUP BY … ORDER BY COUNT(*) DESC LIMIT 1 |
| Range | Max – Min | MAX(col) - MIN(col) |
| Variance | Spread of data | VAR(column) |
| Standard Deviation | Spread in original units | STDDEV(column) |
When to Use Which
- Mean – general average, but sensitive to outliers.
- Median – better for skewed distributions (e.g., salaries, prices).
- Mode – best for categorical data.
Example
If salaries are {50k, 55k, 60k, 65k, 500k}:
| Statistic | Value | Interpretation |
|---|---|---|
| Mean | 146k | Misleading (outlier drags it up) |
| Median | 60k | More representative of typical salary |
Percentiles (PostgreSQL example)
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY response_time) AS median,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_time) AS p99
FROM api_logs;
Percentiles such as P25, P50 (median), P75, and P99 are common for performance metrics.
Aggregations
You’ll write aggregations constantly. Know them cold.
| Function | Purpose |
|---|---|
COUNT | Number of rows |
SUM | Total of values |
AVG | Mean value |
MIN | Smallest value |
MAX | Largest value |
COUNT(DISTINCT) | Unique values |
Group By
GROUP BY partitions your data into sets, then applies aggregations to each set.
SELECT
region,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value
FROM orders
GROUP BY region;
Think of it as: “For each region, calculate these metrics.”
Window Functions
Window functions let you aggregate while keeping individual rows.
SELECT
order_id,
customer_id,
amount,
SUM(amount) OVER (PARTITION BY customer_id) AS customer_total,
AVG(amount) OVER (PARTITION BY customer_id) AS customer_avg
FROM orders;
PARTITION BY is set theory again — you’re defining subsets.
Probability Basics
You don’t need deep probability theory, but a few basics help.
| Concept | Definition |
|---|---|
| Probability | Likelihood of an event, from 0 to 1 |
| Independent Events | One event doesn’t affect another |
| Conditional Probability | Probability given something else happened |
Typical data‑engineering uses:
- Data‑quality checks: “What percentage of records have missing values?”
- Sampling: “Is this sample representative?”
- A/B testing: “Is this result statistically significant?”
Example: Probability of NULL values in a column
SELECT
COUNT(*) AS total_rows,
SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) AS null_count,
ROUND(
SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END)::decimal
/ COUNT(*), 4
) AS null_probability
FROM customers;
Numeric Types & Pitfalls
Understanding how numbers are stored prevents costly mistakes.
| Type | Use Case | Watch Out For |
|---|---|---|
INTEGER | Counts, IDs | Overflow at limits |
FLOAT | Scientific data | Precision errors |
That’s the practical math you need to be a competent data engineer. Master these concepts, and you’ll find SQL and data pipelines far less intimidating.
DECIMAL
Money, exact values
Slower, more storage
-- This might not equal 0.3
SELECT 0.1 + 0.2;
-- Result: 0.30000000000000004
For financial data, always use DECIMAL.
-- Integer division truncates
SELECT 5 / 2; -- Returns 2, not 2.5
-- Cast to get decimals
SELECT 5::decimal / 2; -- Returns 2.5
Patterns I Use Constantly
-- 1️⃣ Category percentages
SELECT
category,
COUNT(*) AS count,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS percentage
FROM products
GROUP BY category;
-- 2️⃣ Year‑over‑year growth
SELECT
year,
revenue,
LAG(revenue) OVER (ORDER BY year) AS prev_year,
ROUND(
100.0 * (revenue - LAG(revenue) OVER (ORDER BY year))
/ LAG(revenue) OVER (ORDER BY year), 2
) AS yoy_growth
FROM annual_revenue;
-- 3️⃣ Running total
SELECT
date,
amount,
SUM(amount) OVER (ORDER BY date) AS running_total
FROM transactions;
-- 4️⃣ 7‑day moving average
SELECT
date,
value,
AVG(value) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS seven_day_avg
FROM daily_metrics;
As a Data Engineer, You Can De‑Prioritize
| Topic | Why You Can Skip It |
|---|---|
| Calculus | Leave it to data scientists |
| Linear Algebra | Only needed for ML engineering |
| Advanced Statistics | Beyond descriptive stats, not essential |
| Proofs and Theorems | You’re building, not proving |
If you move toward machine‑learning engineering later, revisit these. For core data engineering, they’re not required.
The Best Way to Learn Math for Data Engineering – Through SQL
- Take a public dataset – Kaggle has plenty.
- Ask questions – “What’s the median order value by region?”
- Write queries – Apply the concepts above.
- Verify results – Do they make logical sense?
When math is tied to real problems, it becomes intuitive.
Core Concepts to Master
- Set theory – Understand it, and SQL makes sense.
- Boolean logic – Every filter depends on it.
- Descriptive statistics – Mean, median, percentiles.
- Aggregations – Your daily tools.
- Precision – Know your data types.
You don’t need to love math. You need to respect it enough to get the fundamentals right. Theory only takes you so far.
What’s Next?
In the next article we’ll put everything together:
- Building your first data pipeline — from concept to working code
- Data Engineering Uncovered: What It Is and Why It Matters
- Pipelines, ETL, and Warehouses: The DNA of Data Engineering
- Tools of the Trade: What Powers Modern Data Engineering
- The Math You Actually Need as a Data Engineer (you’re here)
- Building Your First Pipeline: From Concept to Execution
- Charting Your Path: Courses and Resources to Accelerate Your Journey
Found this helpful? Have questions about applying these concepts? Drop them in the comments.