The Math You Actually Need as a Data Engineer

Published: 3 hours ago (January 30, 2026 at 10:00 PM EST)

6 min read

Source: Dev.to

Source: Dev.to

Introduction

Let’s address the elephant in the room.

“Do I need to be good at math to become a data engineer?”

I’ve heard this question hundreds of times from students and career‑changers. The answer might surprise you:

You need less math than you think — but more than zero.

Data engineering is not data science. You won’t be deriving gradient‑descent formulas or proving statistical theorems. But you will encounter mathematical concepts daily, often without realizing it.

In this article I’ll cover exactly what you need to know — practical mathematics that makes you a better engineer, without the academic overhead.

Let me be direct. Here’s what matters:

Concept	How Often You’ll Use It
Set Theory	Daily
Boolean Logic	Daily
Basic Statistics	Weekly
Aggregations & Arithmetic	Daily
Probability Basics	Occasionally
Linear Algebra	Rarely
Calculus	Almost never

If you struggled with advanced mathematics in school, don’t let that stop you. Most data‑engineering math is intuitive once you see how it applies.

Set Theory

Every SQL query you write is set theory in action. If you understand sets, you understand SQL. It’s that fundamental.

A set is a collection of distinct elements.

Set A = {1, 2, 3, 4, 5}
Set B = {4, 5, 6, 7, 8}

These map directly to SQL:

Set Operation	SQL Equivalent	Result
Union	`UNION`	All elements in A or B
Intersection	`INNER JOIN`	Elements in both A and B
Difference	`EXCEPT` / `LEFT JOIN`	Elements in A but not B
Cartesian Product	`CROSS JOIN`	All combinations of A and B

Example Queries

-- Union: Combine two sets
SELECT customer_id FROM online_orders
UNION
SELECT customer_id FROM store_orders;

-- Intersection: Find common elements
SELECT a.customer_id
FROM online_orders a
INNER JOIN store_orders b ON a.customer_id = b.customer_id;

-- Difference: Find elements only in first set
SELECT customer_id FROM online_orders
EXCEPT
SELECT customer_id FROM store_orders;

Visualising joins as Venn diagrams helps immensely:

INNER JOIN – only the overlap
LEFT JOIN – all of left + overlap
RIGHT JOIN – all of right + overlap
FULL OUTER JOIN – everything

If set theory clicks, SQL becomes intuitive.

Boolean Logic

Every WHERE clause, IF statement, and CASE expression is Boolean logic.

Operator	Meaning	Example
`AND`	Both must be true	`status = 'active' AND age > 18`
`OR`	At least one must be true	`country = 'US' OR country = 'CA'`
`NOT`	Inverts the condition	`NOT status = 'deleted'`

Truth‑Table Summary

Expression	Result
`TRUE AND NULL`	`NULL`
`FALSE AND NULL`	`FALSE`
`TRUE OR NULL`	`TRUE`
`FALSE OR NULL`	`NULL`
`NOT NULL`	`NULL`
`NULL = NULL`	`NULL` (not `TRUE`)

Tip: WHERE column = NULL never works. Use WHERE column IS NULL instead.

De Morgan’s Laws

NOT (A AND B) = (NOT A) OR (NOT B)
NOT (A OR B) = (NOT A) AND (NOT B)

Practical Example

-- Both statements are equivalent
WHERE NOT (status = 'active' AND region = 'EU')
WHERE status != 'active' OR region != 'EU'

Understanding De Morgan’s Laws helps you debug filters that aren’t behaving as expected.

Descriptive Statistics

You don’t need a statistics degree, but you must understand a few descriptive measures.

Measure	What It Tells You	SQL
Mean	Average value	`AVG(column)`
Median	Middle value	Varies by DB (e.g., `PERCENTILE_CONT(0.5)`)
Mode	Most frequent value	`GROUP BY … ORDER BY COUNT(*) DESC LIMIT 1`
Range	Max – Min	`MAX(col) - MIN(col)`
Variance	Spread of data	`VAR(column)`
Standard Deviation	Spread in original units	`STDDEV(column)`

When to Use Which

Mean – general average, but sensitive to outliers.
Median – better for skewed distributions (e.g., salaries, prices).
Mode – best for categorical data.

Example

If salaries are {50k, 55k, 60k, 65k, 500k}:

Statistic	Value	Interpretation
Mean	146k	Misleading (outlier drags it up)
Median	60k	More representative of typical salary

Percentiles (PostgreSQL example)

SELECT 
    PERCENTILE_CONT(0.5)  WITHIN GROUP (ORDER BY response_time) AS median,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_time) AS p99
FROM api_logs;

Percentiles such as P25, P50 (median), P75, and P99 are common for performance metrics.

Aggregations

You’ll write aggregations constantly. Know them cold.

Function	Purpose
`COUNT`	Number of rows
`SUM`	Total of values
`AVG`	Mean value
`MIN`	Smallest value
`MAX`	Largest value
`COUNT(DISTINCT)`	Unique values

Group By

GROUP BY partitions your data into sets, then applies aggregations to each set.

SELECT 
    region,
    COUNT(*)               AS order_count,
    SUM(amount)            AS total_revenue,
    AVG(amount)            AS avg_order_value
FROM orders
GROUP BY region;

Think of it as: “For each region, calculate these metrics.”

Window Functions

Window functions let you aggregate while keeping individual rows.

SELECT 
    order_id,
    customer_id,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id) AS customer_total,
    AVG(amount) OVER (PARTITION BY customer_id) AS customer_avg
FROM orders;

PARTITION BY is set theory again — you’re defining subsets.

Probability Basics

You don’t need deep probability theory, but a few basics help.

Concept	Definition
Probability	Likelihood of an event, from 0 to 1
Independent Events	One event doesn’t affect another
Conditional Probability	Probability given something else happened

Typical data‑engineering uses:

Data‑quality checks: “What percentage of records have missing values?”
Sampling: “Is this sample representative?”
A/B testing: “Is this result statistically significant?”

Example: Probability of NULL values in a column

SELECT 
    COUNT(*)                                            AS total_rows,
    SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END)       AS null_count,
    ROUND(
        SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END)::decimal 
        / COUNT(*), 4
    )                                                   AS null_probability
FROM customers;

Numeric Types & Pitfalls

Understanding how numbers are stored prevents costly mistakes.

Type	Use Case	Watch Out For
`INTEGER`	Counts, IDs	Overflow at limits
`FLOAT`	Scientific data	Precision errors

That’s the practical math you need to be a competent data engineer. Master these concepts, and you’ll find SQL and data pipelines far less intimidating.

DECIMAL

Money, exact values
Slower, more storage

-- This might not equal 0.3
SELECT 0.1 + 0.2;
-- Result: 0.30000000000000004

For financial data, always use DECIMAL.

-- Integer division truncates
SELECT 5 / 2;          -- Returns 2, not 2.5

-- Cast to get decimals
SELECT 5::decimal / 2; -- Returns 2.5

Patterns I Use Constantly

-- 1️⃣ Category percentages
SELECT 
    category,
    COUNT(*) AS count,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS percentage
FROM products
GROUP BY category;

-- 2️⃣ Year‑over‑year growth
SELECT 
    year,
    revenue,
    LAG(revenue) OVER (ORDER BY year) AS prev_year,
    ROUND(
        100.0 * (revenue - LAG(revenue) OVER (ORDER BY year))
        / LAG(revenue) OVER (ORDER BY year), 2
    ) AS yoy_growth
FROM annual_revenue;

-- 3️⃣ Running total
SELECT 
    date,
    amount,
    SUM(amount) OVER (ORDER BY date) AS running_total
FROM transactions;

-- 4️⃣ 7‑day moving average
SELECT 
    date,
    value,
    AVG(value) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS seven_day_avg
FROM daily_metrics;

As a Data Engineer, You Can De‑Prioritize

Topic	Why You Can Skip It
Calculus	Leave it to data scientists
Linear Algebra	Only needed for ML engineering
Advanced Statistics	Beyond descriptive stats, not essential
Proofs and Theorems	You’re building, not proving

If you move toward machine‑learning engineering later, revisit these. For core data engineering, they’re not required.

The Best Way to Learn Math for Data Engineering – Through SQL

Take a public dataset – Kaggle has plenty.
Ask questions – “What’s the median order value by region?”
Write queries – Apply the concepts above.
Verify results – Do they make logical sense?

When math is tied to real problems, it becomes intuitive.

Core Concepts to Master

Set theory – Understand it, and SQL makes sense.
Boolean logic – Every filter depends on it.
Descriptive statistics – Mean, median, percentiles.
Aggregations – Your daily tools.
Precision – Know your data types.

You don’t need to love math. You need to respect it enough to get the fundamentals right. Theory only takes you so far.

What’s Next?

In the next article we’ll put everything together:

Building your first data pipeline — from concept to working code
Data Engineering Uncovered: What It Is and Why It Matters
Pipelines, ETL, and Warehouses: The DNA of Data Engineering
Tools of the Trade: What Powers Modern Data Engineering
The Math You Actually Need as a Data Engineer (you’re here)
Building Your First Pipeline: From Concept to Execution
Charting Your Path: Courses and Resources to Accelerate Your Journey

Found this helpful? Have questions about applying these concepts? Drop them in the comments.

The Math You Actually Need as a Data Engineer

Introduction

Set Theory

Example Queries

Boolean Logic

Truth‑Table Summary

De Morgan’s Laws

Descriptive Statistics

When to Use Which

Percentiles (PostgreSQL example)

Aggregations

Group By

Window Functions

Probability Basics

Numeric Types & Pitfalls

DECIMAL

Patterns I Use Constantly

As a Data Engineer, You Can De‑Prioritize

The Best Way to Learn Math for Data Engineering – Through SQL

Core Concepts to Master

What’s Next?

Related posts

Optimizing Test Account Management During High Traffic Events with SQL Security Techniques

Battling Spam Traps in Email Campaigns with JavaScript on a Zero-Budget

Marketing Mastery: From Beginner to Pro (Part 1)

Overcoming Geolocation Restrictions in Microservices with Cybersecurity Strategies

Introduction

Set Theory

Example Queries

Boolean Logic

Truth‑Table Summary

De Morgan’s Laws

Descriptive Statistics

When to Use Which

Percentiles (PostgreSQL example)

Aggregations

Group By

Window Functions

Probability Basics

Numeric Types & Pitfalls

DECIMAL

Patterns I Use Constantly

As a Data Engineer, You Can De‑Prioritize

The Best Way to Learn Math for Data Engineering – Through SQL

Core Concepts to Master

What’s Next?

Related posts

Optimizing Test Account Management During High Traffic Events with SQL Security Techniques

Battling Spam Traps in Email Campaigns with JavaScript on a Zero-Budget

Marketing Mastery: From Beginner to Pro (Part 1)

Overcoming Geolocation Restrictions in Microservices with Cybersecurity Strategies

De Morgan’s Laws