Part 2: BigQuery Deep Dive 🔍

Published: 2 days ago (February 9, 2026 at 06:08 PM EST)

7 min read

Source: Dev.to

What is BigQuery?

BigQuery is Google’s fully‑managed, server‑less data warehouse in the cloud. It’s a popular choice for storing and analyzing massive datasets because it is:

Feature	Description
Serverless	No servers to manage – no software installation, disk‑space worries, or maintenance. Google handles everything.
Fully managed	Google takes care of security, backups, scaling, and updates.
Petabyte‑scale	Handles truly huge datasets (1 PB = 1 000 TB = 1 000 000 GB).
SQL‑based	Write standard SQL queries – no new programming language to learn.

Why BigQuery is Great for Beginners 🌟

✅	Benefit
☁️	No setup headaches – Create a project, load data, start querying.
💰	Free tier – 1 TB of queries and 10 GB of storage free each month.
📊	Familiar SQL – If you know basic SQL, you can use BigQuery.
🔗	Works with everything – Google Sheets, Data Studio, Python, R, etc.
🤖	Built‑in ML – Train machine‑learning models using just SQL.

Understanding the architecture helps you write better queries and save money. Don’t worry – we’ll keep it simple!

BigQuery Architecture (High‑Level)

Traditional databases store data and process queries on the same machine.
BigQuery separates the two:

┌───────────────────────────────────────┐
│            YOUR SQL QUERY               │
└─────────────────────┬─────────────────┘
                      │
                      ▼
┌───────────────────────────────────────┐
│          DREMEL (Compute Engine)       │
│  – Query is broken into tiny pieces    │
│  – Thousands of workers run in parallel│
└─────────────────────┬─────────────────┘
                      │
                      │  Jupiter Network (super‑fast!)
                      │  1 TB per second
                      ▼
┌───────────────────────────────────────┐
│            COLOSSUS (Storage)          │
│  – Columnar storage (organized by columns)│
└───────────────────────────────────────┘

Row‑oriented vs. Column‑oriented storage

Traditional (row‑oriented) table

Row	Data
1	`[John, 25, New York, $50 000]`
2	`[Jane, 30, Chicago, $60 000]`
3	`[Bob, 35, Miami, $55 000]`

To retrieve all salaries, the engine reads every row, even though only the salary column is needed.

BigQuery (column‑oriented) table

Column	Values
Names	`[John, Jane, Bob]`
Ages	`[25, 30, 35]`
Cities	`[New York, Chicago, Miami]`
Salaries	`[$50 000, $60 000, $55 000]`

To retrieve all salaries, only the Salaries column is read → far faster and cheaper.

💡 Tip: SELECT * is expensive in BigQuery because it forces a scan of every column. Always specify only the columns you need.

Query Execution Flow

Root Server receives your query.
The query is broken into smaller pieces.
Mixers distribute work to thousands of Leaf Nodes.
Each Leaf Node processes a small data chunk in parallel.
Results flow back up through Mixers to the Root Server.
You receive the final result.

                    ┌──────────┐
                    │   ROOT   │   ← Your query arrives here
                    └────┬─────┘
                         │
          ┌──────────────┼───────────────┐
          ▼              ▼               ▼
      ┌───────┐      ┌───────┐       ┌───────┐
      │ MIXER │      │ MIXER │       │ MIXER │
      └───┬───┘      └───┬───┘       └───┬───┘
          │              │               │
   ┌──────┼───────┐ ┌────┼───────┐ ┌─────┼───────┐
   ▼      ▼       ▼ ▼    ▼       ▼ ▼     ▼       ▼
  [L]    [L]     [L] [L]  [L]    [L] [L]   [L]    [L]

L = Leaf nodes (thousands of them!)

Why it matters: A query that would take hours on a laptop can finish in seconds because thousands of machines work on it simultaneously.

Working with Data in BigQuery

1. External Tables (data stays in Google Cloud Storage)

-- Create an external table that points to Parquet files in GCS
CREATE OR REPLACE EXTERNAL TABLE `my-project.my_dataset.taxi_external`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://my-bucket/taxi_data/*.parquet']
);

When to use external tables

✅	Reason
✅	Save on storage costs (GCS is cheaper than BigQuery storage)
✅	One‑time or occasional analysis
✅	Source data updates frequently
✅	Quick exploration before committing to a load

Downsides

❌	Issue
❌	Slower queries (data read from GCS each time)
❌	No cost estimation before running queries
❌	Cannot partition or cluster (limited optimization)

2. Native (internal) Tables (data copied into BigQuery storage)

-- Load data from an external table into a native BigQuery table
CREATE OR REPLACE TABLE `my-project.my_dataset.taxi_native` AS
SELECT * FROM `my-project.my_dataset.taxi_external`;

When to use native tables

✅	Reason
✅	Frequently queried data
✅	Best query performance
✅	Ability to partition and cluster
✅	Accurate cost estimates before running queries

Downsides

❌	Issue
❌	Higher storage costs
❌	Data duplication (exists in both GCS and BigQuery)

💡 Pro tip: Start with external tables for exploration, then load into native tables once you know which data you actually need.

Pricing Models

Model	Description	Ideal For
On‑Demand – $5 per TB of data scanned	Pay only for the data your queries read.	Occasional users, unpredictable workloads.
Flat‑Rate – e.g., ~$2 000/month for 100 “slots” (compute units)	Pay for dedicated compute capacity; unlimited queries within slot capacity.	Heavy users, predictable workloads.

Before you run a query, BigQuery shows an estimate of how much data will be scanned.

┌───────────────────────────────────────┐
│ Query Editor                           │
│ ────────────────────────────────────── │
│ SELECT * FROM my_table                │
│          WHERE date = '2024-01-01'    │
│ Estimated bytes processed: 12.3 GB    │
└───────────────────────────────────────┘

TL;DR

BigQuery = serverless, fully managed, petabyte‑scale, SQL‑based data warehouse.
Architecture separates compute (Dremel) from columnar storage (Colossus).
External tables = cheap, flexible, slower.
Native tables = fast, feature‑rich, more expensive.
Choose on‑demand pricing for occasional use, flat‑rate for heavy, predictable workloads.

Happy querying! 🚀

│
│  [This query will process 2.5 GB when run]    │ ← Check this!
└────────────────────────────────────────────────┘

Cost calculation

2.5 GB = 0.0025 TB
0.0025 TB × $5 = $0.0125 (about 1 cent)

But if you run that query 100 times a day… the costs add up!

Never use SELECT * unless you absolutely need every column.

✅ Good vs. ❌ Bad

-- ❌ Bad – reads ALL columns
SELECT * FROM taxi_data;

-- ✅ Good – reads only what you need
SELECT pickup_time, dropoff_time, fare_amount
FROM taxi_data;

Use partitioned tables (covered in Part 3).
Preview before running – always check the estimated bytes.

LIMIT wisely

LIMIT does not reduce the amount of data scanned; the filtering happens after the table is read.

-- ❌ Still scans the whole table!
SELECT * FROM huge_table LIMIT 10;

-- ✅ Better – add a WHERE clause first
SELECT *
FROM huge_table
WHERE date = CURRENT_DATE()
LIMIT 10;

Caching

BigQuery caches query results for 24 hours (free!).
When you run the same query twice:
1. First run – scans data, incurs cost.
2. Second run – returns cached result, FREE.

Cache is invalidated when:

Underlying table data changes.
24 hours have passed.
You disable caching in query settings.