第2部分：dbt 项目结构与构建模型 📁

发布: 2天前 (2026年2月17日 GMT+8 06:45)

8 分钟阅读

原文: Dev.to

Source: Dev.to

数据工程Zoomcamp #dbt #AnalyticsEngineering #DataModeling

为什么要建模数据？ 📐

原始数据杂乱且难以查询。维度建模将数据组织成以下结构：

易于理解
查询快速
灵活，适用于不同的分析

Source: …

事实表 vs. 维度表

这是维度建模（也称为 星型模式）的核心。

事实表 (`fct_`)

包含度量或事件。

每个发生的事物对应一行
通常拥有大量行（数百万/数十亿）
包含你想要分析的数值

示例

fct_trips – 每次出租车行程对应一行
fct_sales – 每笔销售对应一行
fct_orders – 每个订单对应一行

-- Example fact table
CREATE TABLE fct_trips AS
SELECT
    trip_id,            -- unique identifier
    pickup_datetime,    -- when it happened
    dropoff_datetime,
    pickup_zone_id,     -- foreign keys to dimensions
    dropoff_zone_id,
    fare_amount,        -- numeric measures
    tip_amount,
    total_amount
FROM transformed_trips;

维度表 (`dim_`)

包含属性或 描述性信息。

每个实体对应一行
通常行数较少
为事实表提供上下文

示例

dim_zones – 每个出租车区域对应一行
dim_customers – 每位客户对应一行
dim_products – 每个产品对应一行

-- Example dimension table
CREATE TABLE dim_zones AS
SELECT
    location_id,   -- primary key
    borough,       -- descriptive attributes
    zone_name,
    service_zone
FROM zone_lookup;

星形模式 ⭐

当你将事实表和维度表连接时，会得到一个星形结构：

                    ┌──────────────┐
                    │  dim_zones   │
                    │  (pickup)    │
                    └───────┬──────┘
                            │
┌──────────────┐    ┌───────┴──────┐    ┌──────────────┐
│  dim_vendors │────│  fct_trips   │────│  dim_zones   │
│              │    │  (center)   │    │  (dropoff)   │
└──────────────┘    └───────┬──────┘    └──────────────┘
                            │
                    ┌───────┴──────┐
                    │ dim_payment  │
                    │    types     │
                    └──────────────┘

为什么它如此强大

-- Easy to answer business questions!
SELECT 
    z.borough,
    COUNT(*) AS trip_count,
    SUM(f.total_amount) AS total_revenue
FROM fct_trips f
JOIN dim_zones z
  ON f.pickup_zone_id = z.location_id
GROUP BY z.borough
ORDER BY total_revenue DESC;

dbt 项目结构

一个 dbt 项目拥有特定的文件夹布局。了解它有助于你在任何项目中快速定位。

taxi_rides_ny/
├── dbt_project.yml      # Project configuration (most important!)
├── profiles.yml         # Database connection (often in ~/.dbt/)
├── packages.yml         # External packages to install
│
├── models/              # ⭐ YOUR SQL MODELS LIVE HERE
│   ├── staging/         # Raw data, minimally cleaned
│   ├── intermediate/   # Complex transformations
│   └── marts/           # Final, business‑ready tables
│
├── seeds/               # CSV files to load as tables
├── macros/              # Reusable SQL functions
├── tests/               # Custom test files
├── snapshots/           # Track data changes over time
└── analysis/            # Ad‑hoc queries (not built)

`dbt_project.yml` 文件

这是 最重要的文件——dbt 会首先查找它。

name: 'taxi_rides_ny'
version: '1.0.0'
profile: 'taxi_rides_ny'  # Must match profiles.yml!

# Default configurations
models:
  taxi_rides_ny:
    staging:
      materialized: view   # Staging models become views
    marts:
      materialized: table  # Mart models become tables

三层模型

dbt 推荐将模型组织为三层。

1. Staging 层 (`staging/`)

目的： 原始数据的干净副本，进行最小的转换。

这里发生的事情

重命名列（snake_case，清晰的名称）
转换数据类型
过滤明显的错误数据

保持与源的 1:1 关系（相同的行，类似的列）。

-- models/staging/stg_green_tripdata.sql
{{ config(materialized='view') }}

with tripdata as (
    select *
    from {{ source('staging', 'green_tripdata') }}
    where vendorid is not null  -- filter bad data
)

select
    cast(vendorid as integer)               as vendor_id,
    cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
    cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
    cast(pulocationid as integer)           as pickup_location_id,
    cast(dolocationid as integer)           as dropoff_location_id,
    cast(passenger_count as integer)        as passenger_count,
    cast(trip_distance as numeric)          as trip_distance,
    cast(fare_amount as numeric)            as fare_amount,
    cast(total_amount as numeric)           as total_amount
from tripdata;

2. Intermediate 层 (`intermediate/`)

目的： 复杂转换、连接、业务逻辑。

这里发生的事情

合并多个 staging 模型
应用业务规则
大量数据操作

不向最终用户暴露。

-- models/intermediate/int_trips_unioned.sql
with green_trips as (
    select *, 'Green' as service_type
    from {{ ref('stg_green_tripdata') }}
),

yellow_trips as (
    select *, 'Yellow' as service_type
    from {{ ref('stg_yellow_tripdata') }}
)

select * from green_trips
union all
select * from yellow_trips;

3. Marts 层 (`marts/`)

目的： 为最终用户准备的业务就绪表。

(继续在此构建你的 mart 模型。)

这里发生的事情：

最终的事实表和维度表
可用于仪表盘和报告

这些表才应该暴露给 BI 工具！

-- models/marts/fct_trips.sql
{{ config(materialized='table') }}

SELECT
    t.trip_id,
    t.service_type,
    t.pickup_datetime,
    t.dropoff_datetime,
    t.pickup_location_id,
    t.dropoff_location_id,
    z_pickup.zone   AS pickup_zone,
    z_dropoff.zone  AS dropoff_zone,
    t.passenger_count,
    t.trip_distance,
    t.fare_amount,
    t.total_amount
FROM {{ ref('int_trips_unioned') }} t
LEFT JOIN {{ ref('dim_zones') }} z_pickup
    ON t.pickup_location_id = z_pickup.location_id
LEFT JOIN {{ ref('dim_zones') }} z_dropoff
    ON t.dropoff_location_id = z_dropoff.location_id;

数据源与 `source()` 函数 📥

什么是数据源？

数据源告诉 dbt 原始数据在仓库中的位置。它们在 YAML 文件中定义：

# models/staging/sources.yml
version: 2

sources:
  - name: staging           # 逻辑名称（由你自行决定）
    database: my_project    # 你的 GCP 项目或数据库
    schema: nytaxi          # BigQuery 数据集或模式
    tables:
      - name: green_tripdata
      - name: yellow_tripdata

使用 `source()` 函数

不要硬编码表名，而是使用 source()：

-- ❌ 错误 – 硬编码路径
SELECT * FROM my_project.nytaxi.green_tripdata;

-- ✅ 正确 – 使用 source()
SELECT * FROM {{ source('staging', 'green_tripdata') }};

优势

只需在一个位置（YAML 文件）更改数据库/模式
dbt 自动跟踪依赖关系
可以为数据源添加新鲜度测试

`ref()` 函数 – 构建依赖关系 🔗

这是 最重要的 dbt 函数！

功能	何时使用	示例
`source()`	读取原始/外部数据	`{{ source('staging', 'green_tripdata') }}`
`ref()`	读取另一个 dbt 模型	`{{ ref('stg_green_tripdata') }}`

`ref()` 的工作原理

-- models/marts/fct_trips.sql
SELECT *
FROM {{ ref('int_trips_unioned') }}  -- References the int_trips_unioned model

ref() 的作用

✅ 解析为正确的 schema/表名
✅ 自动构建依赖图
✅ 确保模型按正确顺序运行

DAG（有向无环图）

dbt 根据你的 ref() 调用构建 依赖图：

┌──────────────────┐     ┌──────────────────┐
│ stg_green_trips  │     │ stg_yellow_trips │
└────────┬─────────┘     └────────┬─────────┘
         │                        │
         └──────────┬─────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │ int_trips_unioned│
         └────────┬─────────┘
                  │
                  ▼
         ┌──────────────────┐
         │    fct_trips     │
         └──────────────────┘

当你运行 dbt build 时，模型会自动按依赖顺序运行！

Seeds – 加载 CSV 文件 🌱

Seeds 让你可以将小型 CSV 文件加载到数据仓库中，作为表使用。

何时使用 Seeds

适用场景

查找表（区域名称、国家代码）
静态映射（供应商 ID → 供应商名称）
很少变动的小型参考数据

不适用场景

大规模数据集（请使用正式的数据加载方式）
经常变更的数据

如何使用 Seeds

将 CSV 文件放入 seeds/ 目录：

seeds/
└── taxi_zone_lookup.csv

CSV 示例（taxi_zone_lookup.csv）：

locationid,borough,zone,service_zone
1,EWR,Newark Airport,EWR
2,Queens,Jamaica Bay,Boro Zone
3,Bronx,Allerton/Pelham Gardens,Boro Zone
...

运行 seed 命令：

dbt seed

在模型中使用 ref() 引用 seed：

-- models/marts/dim_zones.sql
SELECT
    locationid   AS location_id,
    borough,
    zone,
    service_zone
FROM {{ ref('taxi_zone_lookup') }};

第2部分：dbt 项目结构与构建模型 📁

为什么要建模数据？ 📐

事实表 vs. 维度表

事实表 (`fct_`)

维度表 (`dim_`)

星形模式 ⭐

为什么它如此强大

dbt 项目结构

`dbt_project.yml` 文件

三层模型

1. Staging 层 (`staging/`)

2. Intermediate 层 (`intermediate/`)

3. Marts 层 (`marts/`)

数据源与 `source()` 函数 📥

什么是数据源？

使用 `source()` 函数

`ref()` 函数 – 构建依赖关系 🔗

`ref()` 的工作原理

DAG（有向无环图）

Seeds – 加载 CSV 文件 🌱

何时使用 Seeds

如何使用 Seeds

相关文章

第3部分：测试、文档与部署 🚀

Star Schema vs. Snowflake Schema：何时使用每种

Power BI 中的模式与数据建模

两个改变你编写查询方式的 ClickHouse 内部机制

为什么要建模数据？ 📐

事实表 vs. 维度表

事实表 (fct_)

维度表 (dim_)

星形模式 ⭐

为什么它如此强大

dbt 项目结构

dbt_project.yml 文件

三层模型

1. Staging 层 (staging/)

2. Intermediate 层 (intermediate/)

3. Marts 层 (marts/)

数据源与 source() 函数 📥

什么是数据源？

使用 source() 函数

ref() 函数 – 构建依赖关系 🔗

ref() 的工作原理

DAG（有向无环图）

Seeds – 加载 CSV 文件 🌱

何时使用 Seeds

如何使用 Seeds

相关文章

第3部分：测试、文档与部署 🚀

Star Schema vs. Snowflake Schema：何时使用每种

Power BI 中的模式与数据建模

两个改变你编写查询方式的 ClickHouse 内部机制

事实表 (`fct_`)

维度表 (`dim_`)

`dbt_project.yml` 文件

1. Staging 层 (`staging/`)

2. Intermediate 层 (`intermediate/`)

3. Marts 层 (`marts/`)

数据源与 `source()` 函数 📥

使用 `source()` 函数

`ref()` 函数 – 构建依赖关系 🔗

`ref()` 的工作原理