Part 2: dbt Project Structure & Building Models π
Source: Dev.to
DataEngineeringZoomcampβ―#dbtβ―#AnalyticsEngineeringβ―#DataModeling
Why Model Data? π
Raw data is messy and hard to query. Dimensional modeling organizes data into a structure that is:
- Easy to understand
- Fast to query
- Flexible for different analyses
Fact Tables vs. Dimension Tables
This is the core of dimensional modeling (also called star schema).
Fact Tables (fct_)
Contain measurements or events.
- One row per thing that happened
- Usually have many rows (millions/billions)
- Contain numeric values you want to analyze
Examples
fct_tripsβ one row per taxi tripfct_salesβ one row per salefct_ordersβ one row per order
-- Example fact table
CREATE TABLE fct_trips AS
SELECT
trip_id, -- unique identifier
pickup_datetime, -- when it happened
dropoff_datetime,
pickup_zone_id, -- foreign keys to dimensions
dropoff_zone_id,
fare_amount, -- numeric measures
tip_amount,
total_amount
FROM transformed_trips;
Dimension Tables (dim_)
Contain attributes or descriptive information.
- One row per entity
- Usually fewer rows
- Provide context for fact tables
Examples
dim_zonesβ one row per taxi zonedim_customersβ one row per customerdim_productsβ one row per product
-- Example dimension table
CREATE TABLE dim_zones AS
SELECT
location_id, -- primary key
borough, -- descriptive attributes
zone_name,
service_zone
FROM zone_lookup;
The Star Schema β
When you join facts and dimensions, you get a star shape:
ββββββββββββββββ
β dim_zones β
β (pickup) β
βββββββββ¬βββββββ
β
ββββββββββββββββ βββββββββ΄βββββββ ββββββββββββββββ
β dim_vendors ββββββ fct_trips ββββββ dim_zones β
β β β (center) β β (dropoff) β
ββββββββββββββββ βββββββββ¬βββββββ ββββββββββββββββ
β
βββββββββ΄βββββββ
β dim_payment β
β types β
ββββββββββββββββ
Why itβs powerful
-- Easy to answer business questions!
SELECT
z.borough,
COUNT(*) AS trip_count,
SUM(f.total_amount) AS total_revenue
FROM fct_trips f
JOIN dim_zones z
ON f.pickup_zone_id = z.location_id
GROUP BY z.borough
ORDER BY total_revenue DESC;
dbt Project Structure
A dbt project has a specific folder layout. Understanding it helps you navigate any project.
taxi_rides_ny/
βββ dbt_project.yml # Project configuration (most important!)
βββ profiles.yml # Database connection (often in ~/.dbt/)
βββ packages.yml # External packages to install
β
βββ models/ # β YOUR SQL MODELS LIVE HERE
β βββ staging/ # Raw data, minimally cleaned
β βββ intermediate/ # Complex transformations
β βββ marts/ # Final, businessβready tables
β
βββ seeds/ # CSV files to load as tables
βββ macros/ # Reusable SQL functions
βββ tests/ # Custom test files
βββ snapshots/ # Track data changes over time
βββ analysis/ # Adβhoc queries (not built)
The dbt_project.yml File
This is the most important fileβdbt looks for it first.
name: 'taxi_rides_ny'
version: '1.0.0'
profile: 'taxi_rides_ny' # Must match profiles.yml!
# Default configurations
models:
taxi_rides_ny:
staging:
materialized: view # Staging models become views
marts:
materialized: table # Mart models become tables
The Three Model Layers
dbt recommends organizing models into three layers.
1. Staging Layer (staging/)
Purpose: Clean copy of raw data with minimal transformations.
What happens here
- Rename columns (snake_case, clear names)
- Cast data types
- Filter obviously bad data
Keep a 1:1 relationship with the source (same rows, similar columns).
-- models/staging/stg_green_tripdata.sql
{{ config(materialized='view') }}
with tripdata as (
select *
from {{ source('staging', 'green_tripdata') }}
where vendorid is not null -- filter bad data
)
select
cast(vendorid as integer) as vendor_id,
cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
cast(pulocationid as integer) as pickup_location_id,
cast(dolocationid as integer) as dropoff_location_id,
cast(passenger_count as integer) as passenger_count,
cast(trip_distance as numeric) as trip_distance,
cast(fare_amount as numeric) as fare_amount,
cast(total_amount as numeric) as total_amount
from tripdata;
2. Intermediate Layer (intermediate/)
Purpose: Complex transformations, joins, business logic.
What happens here
- Combine multiple staging models
- Apply business rules
- Heavy data manipulation
Not exposed to end users.
-- models/intermediate/int_trips_unioned.sql
with green_trips as (
select *, 'Green' as service_type
from {{ ref('stg_green_tripdata') }}
),
yellow_trips as (
select *, 'Yellow' as service_type
from {{ ref('stg_yellow_tripdata') }}
)
select * from green_trips
union all
select * from yellow_trips;
3. Marts Layer (marts/)
Purpose: Final, businessβready tables for end users.
(Continue building your mart models here.)
What happens here:
- Final fact and dimension tables
- Ready for dashboards and reports
Only these should be exposed to BI tools!
-- models/marts/fct_trips.sql
{{ config(materialized='table') }}
SELECT
t.trip_id,
t.service_type,
t.pickup_datetime,
t.dropoff_datetime,
t.pickup_location_id,
t.dropoff_location_id,
z_pickup.zone AS pickup_zone,
z_dropoff.zone AS dropoff_zone,
t.passenger_count,
t.trip_distance,
t.fare_amount,
t.total_amount
FROM {{ ref('int_trips_unioned') }} t
LEFT JOIN {{ ref('dim_zones') }} z_pickup
ON t.pickup_location_id = z_pickup.location_id
LEFT JOIN {{ ref('dim_zones') }} z_dropoff
ON t.dropoff_location_id = z_dropoff.location_id;
Sources and the source() Function π₯
What are Sources?
Sources tell dbt where your raw data lives in the warehouse. Theyβre defined in YAML files:
# models/staging/sources.yml
version: 2
sources:
- name: staging # Logical name (you choose)
database: my_project # Your GCP project or database
schema: nytaxi # BigQuery dataset or schema
tables:
- name: green_tripdata
- name: yellow_tripdata
Using the source() Function
Instead of hardβcoding table names, use source():
-- β Bad β hardcoded path
SELECT * FROM my_project.nytaxi.green_tripdata;
-- β
Good β using source()
SELECT * FROM {{ source('staging', 'green_tripdata') }};
Benefits
- Change database/schema in one place (YAML file)
- dbt tracks dependencies automatically
- Can add freshness tests on sources
The ref() Function β Building Dependencies π
This is the most important dbt function!
| Function | Use When | Example |
|---|---|---|
source() | Reading raw/external data | {{ source('staging', 'green_tripdata') }} |
ref() | Reading another dbt model | {{ ref('stg_green_tripdata') }} |
How ref() Works
-- models/marts/fct_trips.sql
SELECT *
FROM {{ ref('int_trips_unioned') }} -- References the int_trips_unioned model
What ref() does
- β Resolves to the correct schema/table name
- β Builds the dependency graph automatically
- β Ensures models run in the correct order
The DAG (Directed Acyclic Graph)
dbt builds a dependency graph from your ref() calls:
ββββββββββββββββββββ ββββββββββββββββββββ
β stg_green_trips β β stg_yellow_trips β
ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ
β β
ββββββββββββ¬ββββββββββββββ
β
βΌ
ββββββββββββββββββββ
β int_trips_unionedβ
ββββββββββ¬ββββββββββ
β
βΌ
ββββββββββββββββββββ
β fct_trips β
ββββββββββββββββββββ
When you run dbt build, models run in dependency order automatically!
Seeds β Loading CSV Files π±
Seeds let you load small CSV files into your warehouse as tables.
When to Use Seeds
Good use cases
- Lookup tables (zone names, country codes)
- Static mappings (vendor ID β vendor name)
- Small reference data that rarely changes
Not good for
- Large datasets (use proper data loading)
- Frequently changing data
How to Use Seeds
Put CSV files in the seeds/ folder:
seeds/
βββ taxi_zone_lookup.csv
Example CSV (taxi_zone_lookup.csv):
locationid,borough,zone,service_zone
1,EWR,Newark Airport,EWR
2,Queens,Jamaica Bay,Boro Zone
3,Bronx,Allerton/Pelham Gardens,Boro Zone
...
Run the seed command:
dbt seed
Reference the seed in models using ref():
-- models/marts/dim_zones.sql
SELECT
locationid AS location_id,
borough,
zone,
service_zone
FROM {{ ref('taxi_zone_lookup') }};