Part 2: BigQuery Deep Dive π
Source: Dev.to
What is BigQuery?
BigQuery is Googleβs fullyβmanaged, serverβless data warehouse in the cloud. Itβs a popular choice for storing and analyzing massive datasets because it is:
| Feature | Description |
|---|---|
| Serverless | No servers to manage β no software installation, diskβspace worries, or maintenance. Google handles everything. |
| Fully managed | Google takes care of security, backups, scaling, and updates. |
| Petabyteβscale | Handles truly huge datasets (1β―PB = 1β―000β―TB = 1β―000β―000β―GB). |
| SQLβbased | Write standard SQL queries β no new programming language to learn. |
Why BigQuery is Great for Beginners π
| β | Benefit |
|---|---|
| βοΈ | No setup headaches β Create a project, load data, start querying. |
| π° | Free tier β 1β―TB of queries and 10β―GB of storage free each month. |
| π | Familiar SQL β If you know basic SQL, you can use BigQuery. |
| π | Works with everything β Google Sheets, Data Studio, Python, R, etc. |
| π€ | Builtβin ML β Train machineβlearning models using just SQL. |
Understanding the architecture helps you write better queries and save money. Donβt worry β weβll keep it simple!
BigQuery Architecture (HighβLevel)
Traditional databases store data and process queries on the same machine.
BigQuery separates the two:
βββββββββββββββββββββββββββββββββββββββββ
β YOUR SQL QUERY β
βββββββββββββββββββββββ¬ββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββ
β DREMEL (Compute Engine) β
β β Query is broken into tiny pieces β
β β Thousands of workers run in parallelβ
βββββββββββββββββββββββ¬ββββββββββββββββββ
β
β Jupiter Network (superβfast!)
β 1β―TB per second
βΌ
βββββββββββββββββββββββββββββββββββββββββ
β COLOSSUS (Storage) β
β β Columnar storage (organized by columns)β
βββββββββββββββββββββββββββββββββββββββββ
Rowβoriented vs. Columnβoriented storage
Traditional (rowβoriented) table
| Row | Data |
|---|---|
| 1 | [John, 25, Newβ―York, $50β―000] |
| 2 | [Jane, 30, Chicago, $60β―000] |
| 3 | [Bob, 35, Miami, $55β―000] |
To retrieve all salaries, the engine reads every row, even though only the salary column is needed.
BigQuery (columnβoriented) table
| Column | Values |
|---|---|
| Names | [John, Jane, Bob] |
| Ages | [25, 30, 35] |
| Cities | [Newβ―York, Chicago, Miami] |
| Salaries | [$50β―000, $60β―000, $55β―000] |
To retrieve all salaries, only the Salaries column is read β far faster and cheaper.
π‘ Tip:
SELECT *is expensive in BigQuery because it forces a scan of every column. Always specify only the columns you need.
Query Execution Flow
- Root Server receives your query.
- The query is broken into smaller pieces.
- Mixers distribute work to thousands of Leaf Nodes.
- Each Leaf Node processes a small data chunk in parallel.
- Results flow back up through Mixers to the Root Server.
- You receive the final result.
ββββββββββββ
β ROOT β β Your query arrives here
ββββββ¬ββββββ
β
ββββββββββββββββΌββββββββββββββββ
βΌ βΌ βΌ
βββββββββ βββββββββ βββββββββ
β MIXER β β MIXER β β MIXER β
βββββ¬ββββ βββββ¬ββββ βββββ¬ββββ
β β β
ββββββββΌββββββββ ββββββΌββββββββ βββββββΌββββββββ
βΌ βΌ βΌ βΌ βΌ βΌ βΌ βΌ βΌ
[L] [L] [L] [L] [L] [L] [L] [L] [L]
L = Leaf nodes (thousands of them!)
Why it matters: A query that would take hours on a laptop can finish in seconds because thousands of machines work on it simultaneously.
Working with Data in BigQuery
1. External Tables (data stays in Google Cloud Storage)
-- Create an external table that points to Parquet files in GCS
CREATE OR REPLACE EXTERNAL TABLE `my-project.my_dataset.taxi_external`
OPTIONS (
format = 'PARQUET',
uris = ['gs://my-bucket/taxi_data/*.parquet']
);
When to use external tables
| β | Reason |
|---|---|
| β | Save on storage costs (GCS is cheaper than BigQuery storage) |
| β | Oneβtime or occasional analysis |
| β | Source data updates frequently |
| β | Quick exploration before committing to a load |
Downsides
| β | Issue |
|---|---|
| β | Slower queries (data read from GCS each time) |
| β | No cost estimation before running queries |
| β | Cannot partition or cluster (limited optimization) |
2. Native (internal) Tables (data copied into BigQuery storage)
-- Load data from an external table into a native BigQuery table
CREATE OR REPLACE TABLE `my-project.my_dataset.taxi_native` AS
SELECT * FROM `my-project.my_dataset.taxi_external`;
When to use native tables
| β | Reason |
|---|---|
| β | Frequently queried data |
| β | Best query performance |
| β | Ability to partition and cluster |
| β | Accurate cost estimates before running queries |
Downsides
| β | Issue |
|---|---|
| β | Higher storage costs |
| β | Data duplication (exists in both GCS and BigQuery) |
π‘ Pro tip: Start with external tables for exploration, then load into native tables once you know which data you actually need.
Pricing Models
| Model | Description | Ideal For |
|---|---|---|
| OnβDemand β $5 per TB of data scanned | Pay only for the data your queries read. | Occasional users, unpredictable workloads. |
| FlatβRate β e.g., ~$2β―000/month for 100 βslotsβ (compute units) | Pay for dedicated compute capacity; unlimited queries within slot capacity. | Heavy users, predictable workloads. |
Before you run a query, BigQuery shows an estimate of how much data will be scanned.
βββββββββββββββββββββββββββββββββββββββββ
β Query Editor β
β ββββββββββββββββββββββββββββββββββββββ β
β SELECT * FROM my_table β
β WHERE date = '2024-01-01' β
β Estimated bytes processed: 12.3β―GB β
βββββββββββββββββββββββββββββββββββββββββ
TL;DR
- BigQuery = serverless, fully managed, petabyteβscale, SQLβbased data warehouse.
- Architecture separates compute (Dremel) from columnar storage (Colossus).
- External tables = cheap, flexible, slower.
- Native tables = fast, featureβrich, more expensive.
- Choose onβdemand pricing for occasional use, flatβrate for heavy, predictable workloads.
Happy querying! π
β
β [This query will process 2.5β―GB when run] β β Check this!
ββββββββββββββββββββββββββββββββββββββββββββββββββ
Cost calculation
- 2.5β―GB = 0.0025β―TB
- 0.0025β―TB Γ $5 = $0.0125 (about 1β―cent)
But if you run that query 100β―times a dayβ¦ the costs add up!
Never use SELECT * unless you absolutely need every column.
β Good vs. β Bad
-- β Bad β reads ALL columns
SELECT * FROM taxi_data;
-- β
Good β reads only what you need
SELECT pickup_time, dropoff_time, fare_amount
FROM taxi_data;
- Use partitioned tables (covered in Partβ―3).
- Preview before running β always check the estimated bytes.
LIMIT wisely
LIMITdoes not reduce the amount of data scanned; the filtering happens after the table is read.
-- β Still scans the whole table!
SELECT * FROM huge_table LIMIT 10;
-- β
Better β add a WHERE clause first
SELECT *
FROM huge_table
WHERE date = CURRENT_DATE()
LIMIT 10;
Caching
- BigQuery caches query results for 24β―hours (free!).
- When you run the same query twice:
- First run β scans data, incurs cost.
- Second run β returns cached result, FREE.
Cache is invalidated when:
- Underlying table data changes.
- 24β―hours have passed.
- You disable caching in query settings.
Tags
#DataEngineeringZoomcamp #BigQuery #DataWarehouse #GCP #SQL #CloudComputing