파트 2: dbt 프로젝트 구조 및 모델 구축 📁

발행: 2일 전 (2026년 2월 17일 오전 07:45 GMT+9)

9 분 소요

원문: Dev.to

Source: Dev.to

데이터 엔지니어링 줌캠프 #dbt #AnalyticsEngineering #DataModeling

왜 모델 데이터를 사용하나요? 📐

원시 데이터는 지저분하고 쿼리하기 어렵습니다. 차원 모델링은 데이터를 다음과 같은 구조로 조직합니다:

이해하기 쉬움
쿼리 속도가 빠름
다양한 분석에 유연함

사실 테이블 vs. 차원 테이블

이것은 차원 모델링(별 스키마라고도 함)의 핵심입니다.

사실 테이블 (`fct_`)

측정값 또는 이벤트를 포함합니다.

발생한 각 항목당 한 행
보통 행 수가 많습니다(수백만/수십억)
분석하려는 숫자 값을 포함합니다

예시

fct_trips – 택시 여행당 한 행
fct_sales – 판매당 한 행
fct_orders – 주문당 한 행

-- Example fact table
CREATE TABLE fct_trips AS
SELECT
    trip_id,            -- unique identifier
    pickup_datetime,    -- when it happened
    dropoff_datetime,
    pickup_zone_id,     -- foreign keys to dimensions
    dropoff_zone_id,
    fare_amount,        -- numeric measures
    tip_amount,
    total_amount
FROM transformed_trips;

차원 테이블 (`dim_`)

속성 또는 설명 정보를 포함합니다.

각 엔터티당 한 행
보통 행 수가 적습니다
사실 테이블에 대한 컨텍스트를 제공합니다

예시

dim_zones – 택시 구역당 한 행
dim_customers – 고객당 한 행
dim_products – 제품당 한 행

-- Example dimension table
CREATE TABLE dim_zones AS
SELECT
    location_id,   -- primary key
    borough,       -- descriptive attributes
    zone_name,
    service_zone
FROM zone_lookup;

스타 스키마 ⭐

사실 테이블과 차원 테이블을 조인하면 별 모양이 됩니다:

                    ┌──────────────┐
                    │  dim_zones   │
                    │  (pickup)    │
                    └───────┬──────┘
                            │
┌──────────────┐    ┌───────┴──────┐    ┌──────────────┐
│  dim_vendors │────│  fct_trips   │────│  dim_zones   │
│              │    │  (center)   │    │  (dropoff)   │
└──────────────┘    └───────┬──────┘    └──────────────┘
                            │
                    ┌───────┴──────┐
                    │ dim_payment  │
                    │    types     │
                    └──────────────┘

왜 강력한가

-- 비즈니스 질문에 쉽게 답할 수 있습니다!
SELECT 
    z.borough,
    COUNT(*) AS trip_count,
    SUM(f.total_amount) AS total_revenue
FROM fct_trips f
JOIN dim_zones z
  ON f.pickup_zone_id = z.location_id
GROUP BY z.borough
ORDER BY total_revenue DESC;

dbt 프로젝트 구조

dbt 프로젝트는 특정 폴더 레이아웃을 가지고 있습니다. 이를 이해하면 어떤 프로젝트든 쉽게 탐색할 수 있습니다.

taxi_rides_ny/
├── dbt_project.yml      # Project configuration (most important!)
├── profiles.yml         # Database connection (often in ~/.dbt/)
├── packages.yml         # External packages to install
│
├── models/              # ⭐ YOUR SQL MODELS LIVE HERE
│   ├── staging/         # Raw data, minimally cleaned
│   ├── intermediate/   # Complex transformations
│   └── marts/           # Final, business‑ready tables
│
├── seeds/               # CSV files to load as tables
├── macros/              # Reusable SQL functions
├── tests/               # Custom test files
├── snapshots/           # Track data changes over time
└── analysis/            # Ad‑hoc queries (not built)

`dbt_project.yml` 파일

이 파일은 가장 중요한 파일이며—dbt가 가장 먼저 찾습니다.

name: 'taxi_rides_ny'
version: '1.0.0'
profile: 'taxi_rides_ny'  # Must match profiles.yml!

# Default configurations
models:
  taxi_rides_ny:
    staging:
      materialized: view   # Staging models become views
    marts:
      materialized: table  # Mart models become tables

세 가지 모델 레이어

dbt는 모델을 세 레이어로 구성할 것을 권장합니다.

1. 스테이징 레이어 (`staging/`)

목적: 최소한의 변환으로 원시 데이터를 깨끗하게 복사합니다.

여기서 일어나는 일

컬럼 이름 변경 (snake_case, 명확한 이름)
데이터 타입 캐스팅
명백히 잘못된 데이터 필터링

소스와 1:1 관계를 유지합니다 (동일한 행, 유사한 컬럼).

-- models/staging/stg_green_tripdata.sql
{{ config(materialized='view') }}

with tripdata as (
    select *
    from {{ source('staging', 'green_tripdata') }}
    where vendorid is not null  -- filter bad data
)

select
    cast(vendorid as integer)               as vendor_id,
    cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
    cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
    cast(pulocationid as integer)           as pickup_location_id,
    cast(dolocationid as integer)           as dropoff_location_id,
    cast(passenger_count as integer)        as passenger_count,
    cast(trip_distance as numeric)          as trip_distance,
    cast(fare_amount as numeric)            as fare_amount,
    cast(total_amount as numeric)           as total_amount
from tripdata;

2. 중간 레이어 (`intermediate/`)

목적: 복잡한 변환, 조인, 비즈니스 로직.

여기서 일어나는 일

여러 스테이징 모델 결합
비즈니스 규칙 적용
대규모 데이터 조작

엔드 유저에게는 노출되지 않음.

-- models/intermediate/int_trips_unioned.sql
with green_trips as (
    select *, 'Green' as service_type
    from {{ ref('stg_green_tripdata') }}
),

yellow_trips as (
    select *, 'Yellow' as service_type
    from {{ ref('stg_yellow_tripdata') }}
)

select * from green_trips
union all
select * from yellow_trips;

3. 마트 레이어 (`marts/`)

목적: 최종 비즈니스용 테이블, 엔드 유저용.

(여기서 마트 모델을 계속 구축하세요.)

여기서 일어나는 일:

최종 사실 및 차원 테이블
대시보드와 보고서를 위한 준비

이들만 BI 도구에 노출되어야 합니다!

-- models/marts/fct_trips.sql
{{ config(materialized='table') }}

SELECT
    t.trip_id,
    t.service_type,
    t.pickup_datetime,
    t.dropoff_datetime,
    t.pickup_location_id,
    t.dropoff_location_id,
    z_pickup.zone   AS pickup_zone,
    z_dropoff.zone  AS dropoff_zone,
    t.passenger_count,
    t.trip_distance,
    t.fare_amount,
    t.total_amount
FROM {{ ref('int_trips_unioned') }} t
LEFT JOIN {{ ref('dim_zones') }} z_pickup
    ON t.pickup_location_id = z_pickup.location_id
LEFT JOIN {{ ref('dim_zones') }} z_dropoff
    ON t.dropoff_location_id = z_dropoff.location_id;

Source: …

Sources와 `source()` 함수 📥

Sources란?

Sources는 dbt에게 원시 데이터가 데이터 웨어하우스의 어디에 있는지 알려줍니다. YAML 파일에 정의합니다:

# models/staging/sources.yml
version: 2

sources:
  - name: staging           # 논리적 이름 (사용자가 지정)
    database: my_project    # GCP 프로젝트 또는 데이터베이스
    schema: nytaxi          # BigQuery 데이터셋 또는 스키마
    tables:
      - name: green_tripdata
      - name: yellow_tripdata

`source()` 함수 사용하기

테이블 이름을 하드코딩하는 대신 source()를 사용합니다:

-- ❌ 나쁨 – 하드코딩된 경로
SELECT * FROM my_project.nytaxi.green_tripdata;

-- ✅ 좋음 – source() 사용
SELECT * FROM {{ source('staging', 'green_tripdata') }};

장점

데이터베이스/스키마를 한 곳(YAML 파일)에서 변경 가능
dbt가 의존성을 자동으로 추적
Sources에 신선도 테스트를 추가할 수 있음

`ref()` 함수 – 의존성 구축 🔗

이것은 가장 중요한 dbt 함수입니다!

Function	Use When	Example
`source()`	원시/외부 데이터 읽기	`{{ source('staging', 'green_tripdata') }}`
`ref()`	다른 dbt 모델 읽기	`{{ ref('stg_green_tripdata') }}`

`ref()` 작동 방식

-- models/marts/fct_trips.sql
SELECT *
FROM {{ ref('int_trips_unioned') }}  -- References the int_trips_unioned model

ref()가 하는 일

✅ 올바른 스키마/테이블 이름으로 해결
✅ 의존성 그래프를 자동으로 구축
✅ 모델이 올바른 순서대로 실행되도록 보장

DAG (Directed Acyclic Graph)

dbt는 ref() 호출을 기반으로 의존성 그래프를 구축합니다:

┌──────────────────┐     ┌──────────────────┐
│ stg_green_trips  │     │ stg_yellow_trips │
└────────┬─────────┘     └────────┬─────────┘
         │                        │
         └──────────┬─────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │ int_trips_unioned│
         └────────┬─────────┘
                  │
                  ▼
         ┌──────────────────┐
         │    fct_trips     │
         └──────────────────┘

dbt build를 실행하면 모델이 의존성 순서대로 자동으로 실행됩니다!

Source: …

Seeds – CSV 파일 로드 🌱

Seeds를 사용하면 작은 CSV 파일을 데이터 웨어하우스에 테이블 형태로 로드할 수 있습니다.

Seeds를 사용해야 할 때

적합한 사용 사례

조회 테이블 (예: 구역 이름, 국가 코드)
정적 매핑 (벤더 ID → 벤더 이름)
거의 변경되지 않는 작은 참조 데이터

부적합한 경우

대용량 데이터셋 (정식 데이터 로딩 사용)
자주 변경되는 데이터

Seeds 사용 방법

CSV 파일을 seeds/ 폴더에 넣으세요:

seeds/
└── taxi_zone_lookup.csv

예시 CSV (taxi_zone_lookup.csv):

locationid,borough,zone,service_zone
1,EWR,Newark Airport,EWR
2,Queens,Jamaica Bay,Boro Zone
3,Bronx,Allerton/Pelham Gardens,Boro Zone
...

seed 명령을 실행합니다:

dbt seed

모델에서 ref()를 사용해 seed를 참조합니다:

-- models/marts/dim_zones.sql
SELECT
    locationid   AS location_id,
    borough,
    zone,
    service_zone
FROM {{ ref('taxi_zone_lookup') }};

파트 2: dbt 프로젝트 구조 및 모델 구축 📁

왜 모델 데이터를 사용하나요? 📐

사실 테이블 vs. 차원 테이블

사실 테이블 (`fct_`)

차원 테이블 (`dim_`)

스타 스키마 ⭐

왜 강력한가

dbt 프로젝트 구조

`dbt_project.yml` 파일

세 가지 모델 레이어

1. 스테이징 레이어 (`staging/`)

2. 중간 레이어 (`intermediate/`)

3. 마트 레이어 (`marts/`)

Sources와 `source()` 함수 📥

Sources란?

`source()` 함수 사용하기

`ref()` 함수 – 의존성 구축 🔗

`ref()` 작동 방식

DAG (Directed Acyclic Graph)

Seeds – CSV 파일 로드 🌱

Seeds를 사용해야 할 때

Seeds 사용 방법

관련 글

파트 3: Testing, Documentation & Deployment 🚀

Star Schema vs. Snowflake Schema: 각각 언제 사용해야 할까

Power BI에서 스키마 및 데이터 모델링

쿼리 작성 방식을 바꾸는 두 가지 ClickHouse 내부 구조

왜 모델 데이터를 사용하나요? 📐

사실 테이블 vs. 차원 테이블

사실 테이블 (fct_)

차원 테이블 (dim_)

스타 스키마 ⭐

왜 강력한가

dbt 프로젝트 구조

dbt_project.yml 파일

세 가지 모델 레이어

1. 스테이징 레이어 (staging/)

2. 중간 레이어 (intermediate/)

3. 마트 레이어 (marts/)

Sources와 source() 함수 📥

Sources란?

source() 함수 사용하기

ref() 함수 – 의존성 구축 🔗

ref() 작동 방식

DAG (Directed Acyclic Graph)

Seeds – CSV 파일 로드 🌱

Seeds를 사용해야 할 때

Seeds 사용 방법

관련 글

파트 3: Testing, Documentation & Deployment 🚀

Star Schema vs. Snowflake Schema: 각각 언제 사용해야 할까

Power BI에서 스키마 및 데이터 모델링

쿼리 작성 방식을 바꾸는 두 가지 ClickHouse 내부 구조

사실 테이블 (`fct_`)

차원 테이블 (`dim_`)

`dbt_project.yml` 파일

1. 스테이징 레이어 (`staging/`)

2. 중간 레이어 (`intermediate/`)

3. 마트 레이어 (`marts/`)

Sources와 `source()` 함수 📥

`source()` 함수 사용하기

`ref()` 함수 – 의존성 구축 🔗

`ref()` 작동 방식