Part 2: BigQuery Deep Dive πŸ”

Published: (February 9, 2026 at 06:08 PM EST)
7 min read
Source: Dev.to

Source: Dev.to

What is BigQuery?

BigQuery is Google’s fully‑managed, server‑less data warehouse in the cloud. It’s a popular choice for storing and analyzing massive datasets because it is:

FeatureDescription
ServerlessNo servers to manage – no software installation, disk‑space worries, or maintenance. Google handles everything.
Fully managedGoogle takes care of security, backups, scaling, and updates.
Petabyte‑scaleHandles truly huge datasets (1β€―PB = 1β€―000β€―TB = 1β€―000β€―000β€―GB).
SQL‑basedWrite standard SQL queries – no new programming language to learn.

Why BigQuery is Great for Beginners 🌟

βœ…Benefit
☁️No setup headaches – Create a project, load data, start querying.
πŸ’°Free tier – 1β€―TB of queries and 10β€―GB of storage free each month.
πŸ“ŠFamiliar SQL – If you know basic SQL, you can use BigQuery.
πŸ”—Works with everything – Google Sheets, Data Studio, Python, R, etc.
πŸ€–Built‑in ML – Train machine‑learning models using just SQL.

Understanding the architecture helps you write better queries and save money. Don’t worry – we’ll keep it simple!


BigQuery Architecture (High‑Level)

Traditional databases store data and process queries on the same machine.
BigQuery separates the two:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            YOUR SQL QUERY               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚          DREMEL (Compute Engine)       β”‚
β”‚  – Query is broken into tiny pieces    β”‚
β”‚  – Thousands of workers run in parallelβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β”‚  Jupiter Network (super‑fast!)
                      β”‚  1β€―TB per second
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            COLOSSUS (Storage)          β”‚
β”‚  – Columnar storage (organized by columns)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Row‑oriented vs. Column‑oriented storage

Traditional (row‑oriented) table

RowData
1[John, 25, Newβ€―York, $50β€―000]
2[Jane, 30, Chicago, $60β€―000]
3[Bob, 35, Miami, $55β€―000]

To retrieve all salaries, the engine reads every row, even though only the salary column is needed.

BigQuery (column‑oriented) table

ColumnValues
Names[John, Jane, Bob]
Ages[25, 30, 35]
Cities[Newβ€―York, Chicago, Miami]
Salaries[$50β€―000, $60β€―000, $55β€―000]

To retrieve all salaries, only the Salaries column is read β†’ far faster and cheaper.

πŸ’‘ Tip: SELECT * is expensive in BigQuery because it forces a scan of every column. Always specify only the columns you need.


Query Execution Flow

  1. Root Server receives your query.
  2. The query is broken into smaller pieces.
  3. Mixers distribute work to thousands of Leaf Nodes.
  4. Each Leaf Node processes a small data chunk in parallel.
  5. Results flow back up through Mixers to the Root Server.
  6. You receive the final result.
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   ROOT   β”‚   ← Your query arrives here
                    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
                         β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό              β–Ό               β–Ό
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”
      β”‚ MIXER β”‚      β”‚ MIXER β”‚       β”‚ MIXER β”‚
      β””β”€β”€β”€β”¬β”€β”€β”€β”˜      β””β”€β”€β”€β”¬β”€β”€β”€β”˜       β””β”€β”€β”€β”¬β”€β”€β”€β”˜
          β”‚              β”‚               β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”
   β–Ό      β–Ό       β–Ό β–Ό    β–Ό       β–Ό β–Ό     β–Ό       β–Ό
  [L]    [L]     [L] [L]  [L]    [L] [L]   [L]    [L]

L = Leaf nodes (thousands of them!)

Why it matters: A query that would take hours on a laptop can finish in seconds because thousands of machines work on it simultaneously.


Working with Data in BigQuery

1. External Tables (data stays in Google Cloud Storage)

-- Create an external table that points to Parquet files in GCS
CREATE OR REPLACE EXTERNAL TABLE `my-project.my_dataset.taxi_external`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://my-bucket/taxi_data/*.parquet']
);

When to use external tables

βœ…Reason
βœ…Save on storage costs (GCS is cheaper than BigQuery storage)
βœ…One‑time or occasional analysis
βœ…Source data updates frequently
βœ…Quick exploration before committing to a load

Downsides

❌Issue
❌Slower queries (data read from GCS each time)
❌No cost estimation before running queries
❌Cannot partition or cluster (limited optimization)

2. Native (internal) Tables (data copied into BigQuery storage)

-- Load data from an external table into a native BigQuery table
CREATE OR REPLACE TABLE `my-project.my_dataset.taxi_native` AS
SELECT * FROM `my-project.my_dataset.taxi_external`;

When to use native tables

βœ…Reason
βœ…Frequently queried data
βœ…Best query performance
βœ…Ability to partition and cluster
βœ…Accurate cost estimates before running queries

Downsides

❌Issue
❌Higher storage costs
❌Data duplication (exists in both GCS and BigQuery)

πŸ’‘ Pro tip: Start with external tables for exploration, then load into native tables once you know which data you actually need.


Pricing Models

ModelDescriptionIdeal For
On‑Demand – $5 per TB of data scannedPay only for the data your queries read.Occasional users, unpredictable workloads.
Flat‑Rate – e.g., ~$2β€―000/month for 100 β€œslots” (compute units)Pay for dedicated compute capacity; unlimited queries within slot capacity.Heavy users, predictable workloads.

Before you run a query, BigQuery shows an estimate of how much data will be scanned.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Query Editor                           β”‚
β”‚ ────────────────────────────────────── β”‚
β”‚ SELECT * FROM my_table                β”‚
β”‚          WHERE date = '2024-01-01'    β”‚
β”‚ Estimated bytes processed: 12.3β€―GB    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

TL;DR

  • BigQuery = serverless, fully managed, petabyte‑scale, SQL‑based data warehouse.
  • Architecture separates compute (Dremel) from columnar storage (Colossus).
  • External tables = cheap, flexible, slower.
  • Native tables = fast, feature‑rich, more expensive.
  • Choose on‑demand pricing for occasional use, flat‑rate for heavy, predictable workloads.

Happy querying! πŸš€

β”‚
β”‚  [This query will process 2.5β€―GB when run]    β”‚ ← Check this!
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cost calculation

  • 2.5β€―GB = 0.0025β€―TB
  • 0.0025β€―TB Γ— $5 = $0.0125 (about 1β€―cent)

But if you run that query 100β€―times a day… the costs add up!

Never use SELECT * unless you absolutely need every column.


βœ… Good vs. ❌ Bad

-- ❌ Bad – reads ALL columns
SELECT * FROM taxi_data;

-- βœ… Good – reads only what you need
SELECT pickup_time, dropoff_time, fare_amount
FROM taxi_data;
  • Use partitioned tables (covered in Partβ€―3).
  • Preview before running – always check the estimated bytes.

LIMIT wisely

LIMIT does not reduce the amount of data scanned; the filtering happens after the table is read.

-- ❌ Still scans the whole table!
SELECT * FROM huge_table LIMIT 10;

-- βœ… Better – add a WHERE clause first
SELECT *
FROM huge_table
WHERE date = CURRENT_DATE()
LIMIT 10;

Caching

  • BigQuery caches query results for 24β€―hours (free!).
  • When you run the same query twice:
    1. First run – scans data, incurs cost.
    2. Second run – returns cached result, FREE.

Cache is invalidated when:

  • Underlying table data changes.
  • 24β€―hours have passed.
  • You disable caching in query settings.

Tags

#DataEngineeringZoomcamp #BigQuery #DataWarehouse #GCP #SQL #CloudComputing

0 views
Back to Blog

Related posts

Read more Β»