A Beginner's Guide to Apache Airflow 3
Source: Dev.to
What Is Data Orchestration?
In DataOps (Data Operations), orchestration is the underlying system that manages data workflows (such as ETL pipelines) to ensure tasks run at the right time and in the correct sequence.
What Is a DAG?
A DAG (Directed Acyclic Graph) is a model that contains all the tasks to be run.
| Term | Meaning |
|---|---|
| Directed | Tasks have a specific direction (e.g., extraction → transformation). |
| Acyclic | No circular dependencies – extraction cannot depend on transformation if transformation depends on extraction. |
| Graph | A collection of tasks (nodes) connected by dependencies (edges). |
What Is a Task?
A task is a single unit of work inside a DAG.
Analogy: Think of the DAG as an orchestra conductor and the tasks as the instruments.
Core Airflow Components
| Component | Description |
|---|---|
| Scheduler | Submits tasks to the executor and triggers scheduled workflows. |
| DAG Processor | Reads DAG files and organizes them in the metadata database. |
| Webserver | The Airflow UI for inspecting, triggering, and debugging DAGs and tasks. |
| DAG Folder | A dedicated folder of DAG files that the scheduler reads to decide which tasks to run and when. |
| Metadata Database | Stores the state of tasks, DAGs, and variables. |
Why Not Just Use Cron Jobs?
Cron jobs are like an alarm clock – they run a script at a set time with no regard for dependencies.
Example:
extract.pyscheduled for 12:00 AMtransform.pyscheduled for 1:30 AM
If extraction takes 40 minutes, cron will still trigger the transformation at 1:30 AM, potentially causing corrupted data or a crash.
Airflow, on the other hand, acts like a project manager: it respects task dependencies and only runs downstream tasks when upstream tasks succeed.
Example: Defining a Simple DAG (Python API)
from airflow import DAG
from airflow.providers.standard.operators.python import PythonOperator
from airflow.providers.standard.operators.bash import BashOperator
from datetime import datetime, timedelta
# Step 1: Define your Python function
def my_function():
# Your logic here
pass
# Step 2: Set default arguments
default_args = {
'owner': 'your_name',
'depends_on_past': False, # don't wait for previous DAG runs
'start_date': datetime(2024, 1, 1),
'email_on_failure': False,
'retries': 1, # retry once if it fails
'retry_delay': timedelta(minutes=5)
}
# Step 3: Create DAG object
with DAG(
dag_id='template_dag', # unique DAG identifier
default_args=default_args,
description='Template for new DAGs',
schedule_interval='@daily', # frequency of execution
catchup=False, # don't run for past dates
max_active_runs=1 # run one instance at a time
) as dag:
# Step 4: Define tasks
task1 = PythonOperator(
task_id='python_task',
python_callable=my_function
)
task2 = BashOperator(
task_id='bash_task',
bash_command='echo "Hello World"'
)
# Step 5: Set dependencies
task1 >> task2
These instructions are interpreted by the orchestration engine and run sequentially using the available resources. This is what data engineers call Workflow‑as‑Code.
Example: TaskFlow API (Decorator‑Based)
import json
from airflow.decorators import dag, task
from pendulum import datetime
# 1. Define the DAG using the @dag decorator
@dag(
start_date=datetime(2024, 1, 1),
schedule="@daily",
catchup=False,
tags=["example", "taskflow"],
)
def taskflow_etl_pipeline():
# 2. Extract: returns a dictionary
@task()
def extract():
data_string = '{"1001": 30.5, "1002": 28.2, "1003": 31.1}'
return json.loads(data_string)
# 3. Transform: receives data from the upstream task
@task()
def transform(raw_data: dict):
total_value = sum(raw_data.values())
return {"total": total_value, "count": len(raw_data)}
# 4. Load: final task to "load" or print the data
@task()
def load(processed_data: dict):
print(f"Loading data: Total value is {processed_data['total']}")
# 5. Define dependencies by calling the functions
raw_data = extract()
summary = transform(raw_data)
load(summary)
# Instantiate the DAG
taskflow_etl_pipeline()
How to Verify Your DAG Is Running
-
CLI:
airflow dags list # Shows parsed DAGs airflow dags list-import-errors # Shows syntax errors, if any -
Web UI: Visit
http://localhost:8080to inspect DAG status, logs, and trigger runs manually.
Best Practices for Scalable Workflows
-
Idempotency – A task should produce the same outcome when run multiple times for the same execution date.
-
Atomicity – Each task should perform one well‑defined operation. If the transformation phase fails, you only need to retry that specific task instead of re‑fetching all raw data.

Left – monolith; Right – modular. -
Encapsulation – Define the DAG structure at the top level only. Heavy data processing, API calls, or DB queries should not be placed in the global scope of the file, because the scheduler executes that code every time it parses the file, potentially crashing the Airflow instance.
Takeaway
Apache Airflow may seem intimidating at first, but at its core it is simply a tool designed to bring order to chaos. By embracing orchestration, you transform isolated, manually‑run scripts into reliable, automated data pipelines.
Happy orchestrating!
Data Orchestration
- Data Orchestration is essential to data pipelines; it ensures your data tasks run in the right sequence and at the right time.
DAGs
- DAGs are the blueprint; they provide a map of your tasks and dependencies, ensuring no task runs out of order.
Airflow
- Airflow does the heavy lifting by handling the logistics of executing and monitoring your tasks so you can focus on the logic.
Workflow as Code
- Whether you use traditional operators or the modern, Pythonic TaskFlow API, you have the flexibility to define complex pipelines.