Part 3: Testing, Documentation & Deployment π
Source: Dev.to
Data Engineering Zoomcamp
#dbt #AnalyticsEngineering #DataModeling
Macros β Reusable SQL Functions π§
Without macros (repeated code)
-- β Repeated everywhere
CASE
WHEN payment_type = 1 THEN 'Credit card'
WHEN payment_type = 2 THEN 'Cash'
WHEN payment_type = 3 THEN 'No charge'
WHEN payment_type = 4 THEN 'Dispute'
WHEN payment_type = 5 THEN 'Unknown'
ELSE 'Unknown'
END AS payment_type_description
With macros (write once)
-- macros/get_payment_type_description.sql
{% macro get_payment_type_description(payment_type) %}
CASE {{ payment_type }}
WHEN 1 THEN 'Credit card'
WHEN 2 THEN 'Cash'
WHEN 3 THEN 'No charge'
WHEN 4 THEN 'Dispute'
WHEN 5 THEN 'Unknown'
ELSE 'Unknown'
END
{% endmacro %}
Use it in any model
-- models/staging/stg_green_tripdata.sql
SELECT
payment_type,
{{ get_payment_type_description('payment_type') }} AS payment_type_description
FROM {{ source('staging', 'green_tripdata') }}
Jinja Syntax in dbt
| Syntax | Purpose | Example |
|---|---|---|
{{ }} | Output expression | {{ ref('my_model') }} |
{% %} | Logic / control flow | {% if is_incremental() %} |
{# #} | Comment | {# This is a comment #} |
Packages β Reβuse macros & models built by others
| Package | What it Does |
|---|---|
| dbt_utils | Common SQL helpers (surrogate keys, pivot, etc.) |
| dbt_codegen | Autoβgenerate YAML and SQL |
| dbt_expectations | Great Expectationsβstyle tests |
| dbt_audit_helper | Compare model outputs when refactoring |
Create packages.yml
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
Install packages
dbt deps
Use a macro from a package
-- Using dbt_utils to generate surrogate keys
SELECT
{{ dbt_utils.generate_surrogate_key(['vendorid', 'pickup_datetime']) }} AS trip_id,
*
FROM {{ source('staging', 'green_tripdata') }}
Tests β Ensure data meets expectations
1οΈβ£ Generic Tests (most common)
Add them in a schema YAML file:
# models/staging/schema.yml
version: 2
models:
- name: stg_green_tripdata
columns:
- name: trip_id
tests:
- unique # No duplicate values
- not_null # No null values
- name: payment_type
tests:
- accepted_values:
values: [1, 2, 3, 4, 5, 6] # Allowed values only
- name: pickup_location_id
tests:
- relationships: # Referential integrity
to: ref('dim_zones')
field: location_id
| Test | Description |
|---|---|
unique | No duplicate values in column |
not_null | No NULL values in column |
accepted_values | Values must be in the specified list |
relationships | Values must exist in another table |
2οΈβ£ Singular (custom) Tests
Place a .sql file in the tests/ folder:
-- tests/assert_positive_fare_amount.sql
-- Test FAILS if any rows are returned
SELECT
trip_id,
fare_amount
FROM {{ ref('fct_trips') }}
WHERE fare_amount < 0
Fact table containing all taxi trips (yellow and green).
One row per trip with fare details and zone information.
Generate & serve docs
dbt docs generate # Build the site
dbt docs serve # Open in a browser
The site includes:
- Model descriptions
- Column definitions
- Dependency graph (visual DAG)
- Source information
Common dbt Commands
| Command | What it Does |
|---|---|
dbt run | Build all models (create views/tables) |
dbt test | Run all tests |
dbt build | Run models and tests together (recommended) |
dbt compile | Generate SQL without executing |
dbt debug | Check connection & project configuration |
dbt seed | Load seed CSV files |
dbt deps | Install packages |
dbt docs generate | Build documentation |
dbt docs serve | Serve docs locally |
dbt retry | Retry failed models |
Selecting specific models
# Single model
dbt run --select stg_green_tripdata
# Model + all upstream dependencies
dbt run --select +fct_trips
# Model + all downstream models
dbt run --select stg_green_tripdata+
# Both directions
dbt run --select +fct_trips+
# All models in a folder
dbt run --select staging.*
# Multiple models
dbt run --select stg_green_tripdata stg_yellow_tripdata
# Development (default target)
dbt run
# Production target
dbt run --target prod
Materializations β How dbt persists models
| Type | What it Creates | Typical Use Case |
|---|---|---|
view | SQL view (query stored, runs on access) | Staging models, frequently changing logic |
table | Physical table (data stored) | Final marts, large datasets, performanceβcritical queries |
incremental | Appends new data only | Very large tables, eventβstyle data |
ephemeral | Not created (CTE in downstream) | Helper models, intermediate steps |
Set materialization in a model file
{{ config(materialized='table') }}
SELECT *
FROM {{ ref('stg_trips') }}
Or projectβwide in dbt_project.yml
models:
my_project:
staging:
materialized: view
marts:
materialized: table
Quick Decision Helper
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Should I use a view or a table? β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Use a view when the underlying logic changes often or the dataset is small enough that recomputing on each query is cheap.
Use a table when you need persisted data for performance, downstream consumption, or when the dataset is large and expensive to recompute.
Decision Flow
βΌ
ββββββββββββββββββββββββββββ
β Is the query expensive? β
ββββββββββββββββββββββββββββ
β β
Yes No
β β
βΌ βΌ
βββββββββββ βββββββββββ
β TABLE β β VIEW β
βββββββββββ βββββββββββ
Use VIEW when
- Staging models (simple transformations)
- Logic changes frequently
- Storage cost is a concern
Use TABLE when
- Final marts are queried often
- Complex joins / aggregations
- Query performance matters
Project Overview β NYC Taxi Data
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RAW DATA β
β green_tripdata (GCS/BigQuery) β yellow_tripdata (GCS/BigQuery)β
βββββββββββββββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGING LAYER β
β stg_green_tripdata β stg_yellow_tripdata β
β (cleaned, renamed) β (cleaned, renamed) β
βββββββββββββββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββββββββ
β β
ββββββββββββ¬βββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INTERMEDIATE LAYER β
β int_trips_unioned β
β (green + yellow combined) β
βββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MARTS LAYER β
β βββββββββββββββ βββββββββββββββββ βββββββββββββββββββββββ β
β β dim_zones β β fct_trips β β fct_monthly_zone_rev β β
β β (dimension)β β (fact) β β (report) β β
β βββββββββββββββ βββββββββββββββββ βββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Model Catalog
| Model | Type | Description |
|---|---|---|
stg_green_tripdata | Staging | Cleaned green taxi data |
stg_yellow_tripdata | Staging | Cleaned yellow taxi data |
int_trips_unioned | Intermediate | Combined yellowβ―+β―green trips |
dim_zones | Dimension | Zone lookup table |
fct_trips | Fact | One row per trip |
fct_monthly_zone_revenue | Report | Monthly revenue by zone |
Local Development (DuckDB)
Pros: Free, no cloud account needed
Cons: Limited to your machineβs resources
# 1. Install dbt with the DuckDB adapter
pip install dbt-duckdb
# 2. Clone the project
git clone https://github.com/DataTalksClub/data-engineering-zoomcamp
cd data-engineering-zoomcamp/04-analytics-engineering/taxi_rides_ny
# 3. Create `profiles.yml` in `~/.dbt/`
# 4. Test the connection
dbt debug
# 5. Build the project
dbt build --target prod
Cloud Development (dbt Cloudβ―+β―BigQuery)
Pros: Powerful, team collaboration, scheduler
Cons: Requires a GCP account (free tier available)
- Create a dbt Cloud account (free).
- Connect it to your BigQuery project.
- Clone the repo in the dbt Cloud IDE.
- Run:
dbt build --target prod
Troubleshooting Common Issues π
βProfile not foundβ
- Verify that the
profilename indbt_project.ymlmatches the one inprofiles.yml. - Ensure
profiles.ymlresides in~/.dbt/.
βSource not foundβ
- Check database/schema names in
sources.yml. - Confirm the data is loaded in the warehouse.
- Look for typos in
ref()calls and make sure the referenced model exists.
Optional memory tweak (profiles.yml):
settings:
memory_limit: '2GB'
Key Concepts
- Analytics Engineering bridges data engineering and data analysis.
- dbt brings softwareβengineering best practices to SQL transformations.
- Dimensional modeling organizes data into facts (events) and dimensions (attributes).
- Three layers: staging (raw copy), intermediate (transformations), marts (final consumable tables).
ref()andsource()are the primary functions for building model dependencies.- Testing ensures data quality β use
unique,not_null,accepted_values,relationships. - Documentation is autoβgenerated from YAML descriptions.
dbt buildruns and tests everything in dependency order.
Additional Resources π
- dbt Documentation
- dbt Fundamentals Course (free)
- SQL Refresher for Window Functions
- dbt Community Slack
Happy modeling! π