Part 2: dbt Project Structure & Building Models 📁

Published: 2 days ago (February 16, 2026 at 05:45 PM EST)

6 min read

Source: Dev.to

Source: Dev.to

DataEngineeringZoomcamp #dbt #AnalyticsEngineering #DataModeling

Why Model Data? 📐

Raw data is messy and hard to query. Dimensional modeling organizes data into a structure that is:

Easy to understand
Fast to query
Flexible for different analyses

Fact Tables vs. Dimension Tables

This is the core of dimensional modeling (also called star schema).

Fact Tables (`fct_`)

Contain measurements or events.

One row per thing that happened
Usually have many rows (millions/billions)
Contain numeric values you want to analyze

Examples

fct_trips – one row per taxi trip
fct_sales – one row per sale
fct_orders – one row per order

-- Example fact table
CREATE TABLE fct_trips AS
SELECT
    trip_id,            -- unique identifier
    pickup_datetime,    -- when it happened
    dropoff_datetime,
    pickup_zone_id,     -- foreign keys to dimensions
    dropoff_zone_id,
    fare_amount,        -- numeric measures
    tip_amount,
    total_amount
FROM transformed_trips;

Dimension Tables (`dim_`)

Contain attributes or descriptive information.

One row per entity
Usually fewer rows
Provide context for fact tables

Examples

dim_zones – one row per taxi zone
dim_customers – one row per customer
dim_products – one row per product

-- Example dimension table
CREATE TABLE dim_zones AS
SELECT
    location_id,   -- primary key
    borough,       -- descriptive attributes
    zone_name,
    service_zone
FROM zone_lookup;

The Star Schema ⭐

When you join facts and dimensions, you get a star shape:

                    ┌──────────────┐
                    │  dim_zones   │
                    │  (pickup)    │
                    └───────┬──────┘
                            │
┌──────────────┐    ┌───────┴──────┐    ┌──────────────┐
│  dim_vendors │────│  fct_trips   │────│  dim_zones   │
│              │    │  (center)   │    │  (dropoff)   │
└──────────────┘    └───────┬──────┘    └──────────────┘
                            │
                    ┌───────┴──────┐
                    │ dim_payment  │
                    │    types     │
                    └──────────────┘

Why it’s powerful

-- Easy to answer business questions!
SELECT 
    z.borough,
    COUNT(*) AS trip_count,
    SUM(f.total_amount) AS total_revenue
FROM fct_trips f
JOIN dim_zones z
  ON f.pickup_zone_id = z.location_id
GROUP BY z.borough
ORDER BY total_revenue DESC;

dbt Project Structure

A dbt project has a specific folder layout. Understanding it helps you navigate any project.

taxi_rides_ny/
├── dbt_project.yml      # Project configuration (most important!)
├── profiles.yml         # Database connection (often in ~/.dbt/)
├── packages.yml         # External packages to install
│
├── models/              # ⭐ YOUR SQL MODELS LIVE HERE
│   ├── staging/         # Raw data, minimally cleaned
│   ├── intermediate/   # Complex transformations
│   └── marts/           # Final, business‑ready tables
│
├── seeds/               # CSV files to load as tables
├── macros/              # Reusable SQL functions
├── tests/               # Custom test files
├── snapshots/           # Track data changes over time
└── analysis/            # Ad‑hoc queries (not built)

The `dbt_project.yml` File

This is the most important file—dbt looks for it first.

name: 'taxi_rides_ny'
version: '1.0.0'
profile: 'taxi_rides_ny'  # Must match profiles.yml!

# Default configurations
models:
  taxi_rides_ny:
    staging:
      materialized: view   # Staging models become views
    marts:
      materialized: table  # Mart models become tables

The Three Model Layers

dbt recommends organizing models into three layers.

1. Staging Layer (`staging/`)

Purpose: Clean copy of raw data with minimal transformations.

What happens here

Rename columns (snake_case, clear names)
Cast data types
Filter obviously bad data

Keep a 1:1 relationship with the source (same rows, similar columns).

-- models/staging/stg_green_tripdata.sql
{{ config(materialized='view') }}

with tripdata as (
    select *
    from {{ source('staging', 'green_tripdata') }}
    where vendorid is not null  -- filter bad data
)

select
    cast(vendorid as integer)               as vendor_id,
    cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
    cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
    cast(pulocationid as integer)           as pickup_location_id,
    cast(dolocationid as integer)           as dropoff_location_id,
    cast(passenger_count as integer)        as passenger_count,
    cast(trip_distance as numeric)          as trip_distance,
    cast(fare_amount as numeric)            as fare_amount,
    cast(total_amount as numeric)           as total_amount
from tripdata;

2. Intermediate Layer (`intermediate/`)

Purpose: Complex transformations, joins, business logic.

What happens here

Combine multiple staging models
Apply business rules
Heavy data manipulation

Not exposed to end users.

-- models/intermediate/int_trips_unioned.sql
with green_trips as (
    select *, 'Green' as service_type
    from {{ ref('stg_green_tripdata') }}
),

yellow_trips as (
    select *, 'Yellow' as service_type
    from {{ ref('stg_yellow_tripdata') }}
)

select * from green_trips
union all
select * from yellow_trips;

3. Marts Layer (`marts/`)

Purpose: Final, business‑ready tables for end users.

(Continue building your mart models here.)

What happens here:

Final fact and dimension tables
Ready for dashboards and reports

Only these should be exposed to BI tools!

-- models/marts/fct_trips.sql
{{ config(materialized='table') }}

SELECT
    t.trip_id,
    t.service_type,
    t.pickup_datetime,
    t.dropoff_datetime,
    t.pickup_location_id,
    t.dropoff_location_id,
    z_pickup.zone   AS pickup_zone,
    z_dropoff.zone  AS dropoff_zone,
    t.passenger_count,
    t.trip_distance,
    t.fare_amount,
    t.total_amount
FROM {{ ref('int_trips_unioned') }} t
LEFT JOIN {{ ref('dim_zones') }} z_pickup
    ON t.pickup_location_id = z_pickup.location_id
LEFT JOIN {{ ref('dim_zones') }} z_dropoff
    ON t.dropoff_location_id = z_dropoff.location_id;

Sources and the `source()` Function 📥

What are Sources?

Sources tell dbt where your raw data lives in the warehouse. They’re defined in YAML files:

# models/staging/sources.yml
version: 2

sources:
  - name: staging           # Logical name (you choose)
    database: my_project    # Your GCP project or database
    schema: nytaxi          # BigQuery dataset or schema
    tables:
      - name: green_tripdata
      - name: yellow_tripdata

Using the `source()` Function

Instead of hard‑coding table names, use source():

-- ❌ Bad – hardcoded path
SELECT * FROM my_project.nytaxi.green_tripdata;

-- ✅ Good – using source()
SELECT * FROM {{ source('staging', 'green_tripdata') }};

Benefits

Change database/schema in one place (YAML file)
dbt tracks dependencies automatically
Can add freshness tests on sources

The `ref()` Function – Building Dependencies 🔗

This is the most important dbt function!

Function	Use When	Example
`source()`	Reading raw/external data	`{{ source('staging', 'green_tripdata') }}`
`ref()`	Reading another dbt model	`{{ ref('stg_green_tripdata') }}`

How `ref()` Works

-- models/marts/fct_trips.sql
SELECT *
FROM {{ ref('int_trips_unioned') }}  -- References the int_trips_unioned model

What ref() does

✅ Resolves to the correct schema/table name
✅ Builds the dependency graph automatically
✅ Ensures models run in the correct order

The DAG (Directed Acyclic Graph)

dbt builds a dependency graph from your ref() calls:

┌──────────────────┐     ┌──────────────────┐
│ stg_green_trips  │     │ stg_yellow_trips │
└────────┬─────────┘     └────────┬─────────┘
         │                        │
         └──────────┬─────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │ int_trips_unioned│
         └────────┬─────────┘
                  │
                  ▼
         ┌──────────────────┐
         │    fct_trips     │
         └──────────────────┘

When you run dbt build, models run in dependency order automatically!

Seeds – Loading CSV Files 🌱

Seeds let you load small CSV files into your warehouse as tables.

When to Use Seeds

Good use cases

Lookup tables (zone names, country codes)
Static mappings (vendor ID → vendor name)
Small reference data that rarely changes

Not good for

Large datasets (use proper data loading)
Frequently changing data

How to Use Seeds

Put CSV files in the seeds/ folder:

seeds/
└── taxi_zone_lookup.csv

Example CSV (taxi_zone_lookup.csv):

locationid,borough,zone,service_zone
1,EWR,Newark Airport,EWR
2,Queens,Jamaica Bay,Boro Zone
3,Bronx,Allerton/Pelham Gardens,Boro Zone
...

Run the seed command:

dbt seed

Reference the seed in models using ref():

-- models/marts/dim_zones.sql
SELECT
    locationid   AS location_id,
    borough,
    zone,
    service_zone
FROM {{ ref('taxi_zone_lookup') }};

Part 2: dbt Project Structure & Building Models 📁

Why Model Data? 📐

Fact Tables vs. Dimension Tables

Fact Tables (`fct_`)

Dimension Tables (`dim_`)

The Star Schema ⭐

Why it’s powerful

dbt Project Structure

The `dbt_project.yml` File

The Three Model Layers

1. Staging Layer (`staging/`)

2. Intermediate Layer (`intermediate/`)

3. Marts Layer (`marts/`)

Sources and the `source()` Function 📥

What are Sources?

Using the `source()` Function

The `ref()` Function – Building Dependencies 🔗

How `ref()` Works

The DAG (Directed Acyclic Graph)

Seeds – Loading CSV Files 🌱

When to Use Seeds

How to Use Seeds

Related posts

Part 3: Testing, Documentation & Deployment 🚀

Star Schema vs. Snowflake Schema: When to Use Each

SCHEMAS & DATA MODELLING IN POWER BI

Arquitetura de Alta Performance: O 'Sob o Capô' da Modern Data Stack

Why Model Data? 📐

Fact Tables vs. Dimension Tables

Fact Tables (fct_)

Dimension Tables (dim_)

The Star Schema ⭐

Why it’s powerful

dbt Project Structure

The dbt_project.yml File

The Three Model Layers

1. Staging Layer (staging/)

2. Intermediate Layer (intermediate/)

3. Marts Layer (marts/)

Sources and the source() Function 📥

What are Sources?

Using the source() Function

The ref() Function – Building Dependencies 🔗

How ref() Works

The DAG (Directed Acyclic Graph)

Seeds – Loading CSV Files 🌱

When to Use Seeds

How to Use Seeds

Related posts

Part 3: Testing, Documentation & Deployment 🚀

Star Schema vs. Snowflake Schema: When to Use Each

SCHEMAS & DATA MODELLING IN POWER BI

Arquitetura de Alta Performance: O 'Sob o Capô' da Modern Data Stack

Fact Tables (`fct_`)

Dimension Tables (`dim_`)

The `dbt_project.yml` File

1. Staging Layer (`staging/`)

2. Intermediate Layer (`intermediate/`)

3. Marts Layer (`marts/`)

Sources and the `source()` Function 📥

Using the `source()` Function

The `ref()` Function – Building Dependencies 🔗

How `ref()` Works