The Math You Actually Need as a Data Engineer

Published: (January 30, 2026 at 10:00 PM EST)
6 min read
Source: Dev.to

Source: Dev.to

Introduction

Let’s address the elephant in the room.

“Do I need to be good at math to become a data engineer?”

I’ve heard this question hundreds of times from students and career‑changers. The answer might surprise you:

You need less math than you think — but more than zero.

Data engineering is not data science. You won’t be deriving gradient‑descent formulas or proving statistical theorems. But you will encounter mathematical concepts daily, often without realizing it.

In this article I’ll cover exactly what you need to know — practical mathematics that makes you a better engineer, without the academic overhead.

Let me be direct. Here’s what matters:

ConceptHow Often You’ll Use It
Set TheoryDaily
Boolean LogicDaily
Basic StatisticsWeekly
Aggregations & ArithmeticDaily
Probability BasicsOccasionally
Linear AlgebraRarely
CalculusAlmost never

If you struggled with advanced mathematics in school, don’t let that stop you. Most data‑engineering math is intuitive once you see how it applies.


Set Theory

Every SQL query you write is set theory in action. If you understand sets, you understand SQL. It’s that fundamental.

A set is a collection of distinct elements.

Set A = {1, 2, 3, 4, 5}
Set B = {4, 5, 6, 7, 8}

These map directly to SQL:

Set OperationSQL EquivalentResult
UnionUNIONAll elements in A or B
IntersectionINNER JOINElements in both A and B
DifferenceEXCEPT / LEFT JOINElements in A but not B
Cartesian ProductCROSS JOINAll combinations of A and B

Example Queries

-- Union: Combine two sets
SELECT customer_id FROM online_orders
UNION
SELECT customer_id FROM store_orders;

-- Intersection: Find common elements
SELECT a.customer_id
FROM online_orders a
INNER JOIN store_orders b ON a.customer_id = b.customer_id;

-- Difference: Find elements only in first set
SELECT customer_id FROM online_orders
EXCEPT
SELECT customer_id FROM store_orders;

Visualising joins as Venn diagrams helps immensely:

  • INNER JOIN – only the overlap
  • LEFT JOIN – all of left + overlap
  • RIGHT JOIN – all of right + overlap
  • FULL OUTER JOIN – everything

If set theory clicks, SQL becomes intuitive.

Boolean Logic

Every WHERE clause, IF statement, and CASE expression is Boolean logic.

OperatorMeaningExample
ANDBoth must be truestatus = 'active' AND age > 18
ORAt least one must be truecountry = 'US' OR country = 'CA'
NOTInverts the conditionNOT status = 'deleted'

Truth‑Table Summary

ExpressionResult
TRUE AND NULLNULL
FALSE AND NULLFALSE
TRUE OR NULLTRUE
FALSE OR NULLNULL
NOT NULLNULL
NULL = NULLNULL (not TRUE)

Tip: WHERE column = NULL never works. Use WHERE column IS NULL instead.

De Morgan’s Laws

  • NOT (A AND B) = (NOT A) OR (NOT B)
  • NOT (A OR B) = (NOT A) AND (NOT B)

Practical Example

-- Both statements are equivalent
WHERE NOT (status = 'active' AND region = 'EU')
WHERE status != 'active' OR region != 'EU'

Understanding De Morgan’s Laws helps you debug filters that aren’t behaving as expected.

Descriptive Statistics

You don’t need a statistics degree, but you must understand a few descriptive measures.

MeasureWhat It Tells YouSQL
MeanAverage valueAVG(column)
MedianMiddle valueVaries by DB (e.g., PERCENTILE_CONT(0.5))
ModeMost frequent valueGROUP BY … ORDER BY COUNT(*) DESC LIMIT 1
RangeMax – MinMAX(col) - MIN(col)
VarianceSpread of dataVAR(column)
Standard DeviationSpread in original unitsSTDDEV(column)

When to Use Which

  • Mean – general average, but sensitive to outliers.
  • Median – better for skewed distributions (e.g., salaries, prices).
  • Mode – best for categorical data.

Example

If salaries are {50k, 55k, 60k, 65k, 500k}:

StatisticValueInterpretation
Mean146kMisleading (outlier drags it up)
Median60kMore representative of typical salary

Percentiles (PostgreSQL example)

SELECT 
    PERCENTILE_CONT(0.5)  WITHIN GROUP (ORDER BY response_time) AS median,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_time) AS p99
FROM api_logs;

Percentiles such as P25, P50 (median), P75, and P99 are common for performance metrics.

Aggregations

You’ll write aggregations constantly. Know them cold.

FunctionPurpose
COUNTNumber of rows
SUMTotal of values
AVGMean value
MINSmallest value
MAXLargest value
COUNT(DISTINCT)Unique values

Group By

GROUP BY partitions your data into sets, then applies aggregations to each set.

SELECT 
    region,
    COUNT(*)               AS order_count,
    SUM(amount)            AS total_revenue,
    AVG(amount)            AS avg_order_value
FROM orders
GROUP BY region;

Think of it as: “For each region, calculate these metrics.”

Window Functions

Window functions let you aggregate while keeping individual rows.

SELECT 
    order_id,
    customer_id,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id) AS customer_total,
    AVG(amount) OVER (PARTITION BY customer_id) AS customer_avg
FROM orders;

PARTITION BY is set theory again — you’re defining subsets.

Probability Basics

You don’t need deep probability theory, but a few basics help.

ConceptDefinition
ProbabilityLikelihood of an event, from 0 to 1
Independent EventsOne event doesn’t affect another
Conditional ProbabilityProbability given something else happened

Typical data‑engineering uses:

  • Data‑quality checks: “What percentage of records have missing values?”
  • Sampling: “Is this sample representative?”
  • A/B testing: “Is this result statistically significant?”

Example: Probability of NULL values in a column

SELECT 
    COUNT(*)                                            AS total_rows,
    SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END)       AS null_count,
    ROUND(
        SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END)::decimal 
        / COUNT(*), 4
    )                                                   AS null_probability
FROM customers;

Numeric Types & Pitfalls

Understanding how numbers are stored prevents costly mistakes.

TypeUse CaseWatch Out For
INTEGERCounts, IDsOverflow at limits
FLOATScientific dataPrecision errors

That’s the practical math you need to be a competent data engineer. Master these concepts, and you’ll find SQL and data pipelines far less intimidating.

DECIMAL

Money, exact values
Slower, more storage

-- This might not equal 0.3
SELECT 0.1 + 0.2;
-- Result: 0.30000000000000004

For financial data, always use DECIMAL.

-- Integer division truncates
SELECT 5 / 2;          -- Returns 2, not 2.5

-- Cast to get decimals
SELECT 5::decimal / 2; -- Returns 2.5

Patterns I Use Constantly

-- 1️⃣ Category percentages
SELECT 
    category,
    COUNT(*) AS count,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS percentage
FROM products
GROUP BY category;
-- 2️⃣ Year‑over‑year growth
SELECT 
    year,
    revenue,
    LAG(revenue) OVER (ORDER BY year) AS prev_year,
    ROUND(
        100.0 * (revenue - LAG(revenue) OVER (ORDER BY year))
        / LAG(revenue) OVER (ORDER BY year), 2
    ) AS yoy_growth
FROM annual_revenue;
-- 3️⃣ Running total
SELECT 
    date,
    amount,
    SUM(amount) OVER (ORDER BY date) AS running_total
FROM transactions;
-- 4️⃣ 7‑day moving average
SELECT 
    date,
    value,
    AVG(value) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS seven_day_avg
FROM daily_metrics;

As a Data Engineer, You Can De‑Prioritize

TopicWhy You Can Skip It
CalculusLeave it to data scientists
Linear AlgebraOnly needed for ML engineering
Advanced StatisticsBeyond descriptive stats, not essential
Proofs and TheoremsYou’re building, not proving

If you move toward machine‑learning engineering later, revisit these. For core data engineering, they’re not required.

The Best Way to Learn Math for Data Engineering – Through SQL

  1. Take a public dataset – Kaggle has plenty.
  2. Ask questions – “What’s the median order value by region?”
  3. Write queries – Apply the concepts above.
  4. Verify results – Do they make logical sense?

When math is tied to real problems, it becomes intuitive.

Core Concepts to Master

  • Set theory – Understand it, and SQL makes sense.
  • Boolean logic – Every filter depends on it.
  • Descriptive statistics – Mean, median, percentiles.
  • Aggregations – Your daily tools.
  • Precision – Know your data types.

You don’t need to love math. You need to respect it enough to get the fundamentals right. Theory only takes you so far.

What’s Next?

In the next article we’ll put everything together:

  • Building your first data pipeline — from concept to working code
  • Data Engineering Uncovered: What It Is and Why It Matters
  • Pipelines, ETL, and Warehouses: The DNA of Data Engineering
  • Tools of the Trade: What Powers Modern Data Engineering
  • The Math You Actually Need as a Data Engineer (you’re here)
  • Building Your First Pipeline: From Concept to Execution
  • Charting Your Path: Courses and Resources to Accelerate Your Journey

Found this helpful? Have questions about applying these concepts? Drop them in the comments.

Back to Blog

Related posts

Read more »