Module 3 Summary - Data Warehousing & BigQuery
Source: Dev.to
Understanding Data Warehouses & OLAP vs OLTP π’
Why Do We Need Data Warehouses? π€
Imagine you run an online store. Your website has a database that handles:
- Customer signβups
- Product orders
- Payment processing
- Inventory updates
This database needs to be FAST because customers are waiting. Every millisecond counts!
Now, your boss asks: βWhat were our topβselling products last year by region, and how did that compare to the year before?β
Running that query on your production database would:
- Slow down your website (bad for customers!)
- Take forever because the database isnβt designed for such complex analysis
- Potentially crash things if the query is too heavy
This is exactly why data warehouses exist! Theyβre a separate place to store your data, specifically designed for answering complex analytical questions without affecting your live applications.
OLTP vs OLAP β The Two Worlds of Databases
| Aspect | OLTP (Transactional) | OLAP (Analytical) |
|---|---|---|
| What itβs for | Running your app β orders, logins, updates | Answering business questions β reports, dashboards |
| Type of queries | Simple: βGet user #123βs infoβ | Complex: βShow sales trends by region for 5 yearsβ |
| Speed | Super fast for small operations | Can take minutes for huge analyses |
| Data freshness | Realβtime, always upβtoβdate | Usually updated daily/hourly (batch) |
| Data organization | Normalized (many tables, no duplicates) | Denormalized (fewer tables, some duplication OK) |
| Data size | Gigabytes (current data) | Terabytes/Petabytes (years of history) |
| Who uses it | Your application, customers | Data analysts, managers, executives |
| Examples | MySQL, PostgreSQL for your website/app | BigQuery, Snowflake, Amazon Redshift |
Real-World Example π
OLTP scenario (your app database)
-- A customer places an order β needs to be FAST
INSERT INTO orders (customer_id, product_id, quantity, price)
VALUES (123, 456, 2, 29.99);
OLAP scenario (data warehouse)
-- Your CEO wants to know Q4 performance β can take a minute, thatβs fine
SELECT
region,
product_category,
SUM(revenue) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM sales_data
WHERE order_date BETWEEN '2023-10-01' AND '2023-12-31'
GROUP BY region, product_category
ORDER BY total_revenue DESC;
π‘ Key insight: OLTP is like a cashier at a store β fast, handles one customer at a time. OLAP is like the accounting department β takes time to analyze all the receipts and produce reports.
What Exactly is a Data Warehouse? ποΈ
A data warehouse is a centralized repository where you collect data from all your different systems and store it in a way thatβs optimized for analysis.
Think of it like this:
Imagine a company with multiple departments:
- Sales team uses Salesforce
- Marketing uses HubSpot
- Website runs on PostgreSQL
- Inventory managed in SAP
Each system has its own database, but your CEO wants a report combining data from all of them. This is where a data warehouse comes in!
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Salesforce β β HubSpot β β PostgreSQL β
β (Sales) β β (Marketing) β β (Website) β
ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ
β β β
β ETL/ELT β β
β (Extract, β β
β Transform, β β
β Load) β β
βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA WAREHOUSE (BigQuery) β
β All your data, cleaned, organized, ready β
β for analysis! β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββ
β Reports, Dashboards, β
β Machine Learning, etc. β
ββββββββββββββββββββββββββββ
Key characteristics of a data warehouse
- π Subjectβoriented β Organized by business topics (sales, customers, products)
- π Integrated β Data from multiple sources combined together
- π Timeβvariant β Keeps historical data (years worth!)
- π Nonβvolatile β Data doesnβt change once loaded (itβs a historical record)
Modern Cloud Data Warehouses βοΈ
Traditional data warehouses (like Oracle, Teradata) required:
- Buying expensive hardware
- Hiring DBAs to manage servers
- Months of setup time
- Huge upfront costs
Modern cloud data warehouses (BigQuery, Snowflake, Redshift) changed everything:
- β No servers to manage (serverless)
- β Pay only for what you use
- β Scales automatically
- β Set up in minutes
- β Access from anywhere