Building an Incremental Zoho Desk to BigQuery Pipeline: Lessons from the Trenches

Published: (February 28, 2026 at 05:35 PM EST)
8 min read
Source: Dev.to

Source: Dev.to

When my company decided to centralise customer‑support analytics, the task landed on my plate: pull data from Zoho Desk, land it in BigQuery, transform with dbt, done. The plan looked clean on paper.

What followed was a master‑class in why production data engineering is never as simple as the happy path suggests.

This is the story of building that pipeline—the architecture decisions, the walls I hit, and the lessons I’ll carry into every pipeline I build from here.

The Starting Point: A Full Load That Took Forever

The first working version of the pipeline was blunt but functional. Every day it:

  1. Pulled every ticket ever created from the Zoho Desk API.
  2. Overwrote the BigQuery table.

Simple. Predictable. And for tickets, completely untenable at scale because the table had over a million rows and was growing daily.

I needed incremental loads. But before I could get there, I had a different problem to solve: how do I get all the historical data into BigQuery in the first place without running an API job for days?

The Bootstrapping Problem: When the API Is Too Slow for the Initial Load

Loading years of historical data through a paginated API isn’t a pipeline problem—it’s a waiting problem. For a table with hundreds of thousands of rows, even a well‑optimised API pull can take hours or days just for the seed load.

Solution: Sidestep the API entirely for the initial load. Zoho Desk has a built‑in data backup feature that exports your entire account data as CSV files. I used this to export a full snapshot of tickets, threads, contacts, and calls, then loaded each CSV directly into BigQuery via the BQ Console UI.

UI Load Process

SettingValue
FormatCSV
SchemaDefined manually (not auto‑detect — more on why this matters later)
Skip leading rows1 (header row)
Allow quoted newlinesYes (critical for fields like ticket descriptions that contain line breaks)
Allow jagged rowsYes (API responses sometimes omit optional fields)

Once the historical snapshot was in BigQuery, I set the incremental pipeline’s start_date to the backup date. The first scheduled run picks up any changes from that day forward—no gap, no overlap.

Lesson: For large initial loads, don’t fight the API. Use native export features if they exist. The pipeline is for keeping data fresh; getting the history in is a one‑time bootstrapping problem that deserves its own solution.

The Architecture: Code Generation Over Copy‑Paste

Instead of writing a separate Airflow DAG for every Zoho Desk endpoint, I built a code‑generation system:

  • Custom Airflow operator (ZohoDeskToGCSOperator) – handles all API extraction logic (pagination, OAuth, concurrent detail fetching, incremental search).
  • Jinja template – defines the DAG structure once.
  • YAML config files – one per endpoint; each defines schedule, columns, schema, endpoint type.
  • Generator script – renders YAML + template to a DAG Python file.
flowchart TD
    A[Zoho Desk API] --> B[ZohoDeskToGCSOperator]
    B --> C[GCS (staging CSV)]
    C --> D[GCSToBigQueryOperator]
    D --> E[BigQuery (_staging table)]
    E --> F[BigQueryInsertJobOperator (MERGE into main table)]
  • For large transactional tables (tickets, contacts, threads, calls), data lands in a _staging table first, then gets merged into the main table (updating existing rows and inserting new ones).
  • For small reference tables (agents, teams, departments), a daily WRITE_TRUNCATE is sufficient.

Challenge 1: Not All APIs Are Created Equal

  • The ticket and contact endpoints support Zoho’s modifiedTimeRange parameter, allowing incremental loads by specifying a start and end timestamp.
  • The /calls endpoint does not. Supplying modifiedTimeRange returns a 422 error.

Work‑around: Sort by createdTime descending and stop paginating as soon as the oldest record on the current page predates the window. Since calls are append‑only in practice, this is equivalent.

for rec in records:
    if rec["createdTime"] < data_interval_start:
        done = True
        break

Lesson: Don’t assume API feature parity across endpoints from the same vendor. Test every endpoint independently before writing a single line of pipeline code.

Challenge 2: A Reserved Word Hiding in Your Column Names

The threads table has a column called to (the recipient of a thread). Unfortunately, TO is a reserved keyword in BigQuery SQL.

The generated MERGE statement originally looked like:

INSERT (`from`, `to`, `subject`, ...)
VALUES (S.`from`, S.`to`, S.`subject`, ...)

Because to was unquoted, BigQuery’s parser threw:

Syntax error: Unexpected keyword TO at [40:130]

Fix: Back‑tick‑quote the column name (and, in fact, quote all identifiers) in the generated MERGE SQL.

Lesson: When generating SQL programmatically, always quote all identifiers. You won’t always know which column names will collide with reserved words.

TL;DR Takeaways

#Takeaway
1Use native data‑export features for the initial bulk load; treat it as a separate bootstrapping step.
2Build a code‑generation pipeline (YAML + Jinja + custom operator) to avoid copy‑paste and keep the architecture DRY.
3Test each API endpoint individually—features like modifiedTimeRange are not guaranteed across the board.
4Always quote identifiers in generated SQL to avoid hidden reserved‑word conflicts.
5For large tables, stage in a temporary table and MERGE; for small reference tables, WRITE_TRUNCATE is fine.

Challenge 3: The MERGE That Couldn’t Match Rows

After fixing the syntax error, the threads MERGE hit a different wall:

UPDATE/MERGE must match at most one source row for each target row

The threads endpoint works like this:

  1. Search for tickets modified in the time window.
  2. Fetch all threads for each of those tickets.

Zoho’s modifiedTimeRange search is paginated, and the same ticket can appear on multiple pages if the result set shifts between requests. When that happens, threads are fetched for the same ticket twice, causing duplicate thread IDs in the staging table. BigQuery’s MERGE refuses to update a target row when multiple source rows match it.

Fixes

In the operator (Python) – deduplicate ticket IDs before fetching threads:

ticket_ids = list(dict.fromkeys(ticket_ids))

dict.fromkeys removes duplicates while preserving insertion order, which is cleaner than converting to a set and back.

In the MERGE SQL (template) – add a deduplication guard in the USING clause:

USING (
    SELECT *
    FROM `staging_table`
    QUALIFY ROW_NUMBER() OVER (PARTITION BY `id`) = 1
) S

The Python fix prevents the problem from occurring; the SQL guard is a safety net for edge cases you haven’t anticipated.

Lesson: For MERGE pipelines, always add a QUALIFY ROW_NUMBER() dedup guard in the staging select. Even if your source looks clean, it defends against unexpected duplicates.

Challenge 4: Auto‑Detect Might Tell Lies

When I loaded the initial backup CSVs into BigQuery, I let auto‑detect infer the schema rather than defining it explicitly. Fast and convenient, but it caused failures.

Auto‑detect made choices that didn’t match what the incremental pipeline expected:

ColumnExpected TypeAuto‑detect Result
onholdTimeTIMESTAMPSTRING
tagCountSTRINGINT64 (correct)
isEscalatedBOOLEANSTRING (values "true"/"false")

The first MERGE tried to assign a TIMESTAMP value to a STRING column, and BigQuery’s type enforcement at MERGE time is strict.

Fix

Query INFORMATION_SCHEMA.COLUMNS on the main table and reconcile every column type against the YAML schema:

SELECT column_name, data_type
FROM `project.dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'zoho_desk_tickets'
ORDER BY ordinal_position;

Lesson

The initial load is the source of truth—not your assumptions about what the types should be. Always verify INFORMATION_SCHEMA before writing a schema config for any table that was bootstrapped manually.

The dbt Layer: Fixing What the Raw Layer Can’t

With the raw pipeline stable, dbt handles the transformation into clean, analyst‑ready tables. This keeps the raw layer as a faithful copy of the source, while the transformation layer handles type normalisation.

What I’d Do Differently

  • Use native export for the initial load – don’t fight the API for historical data. Export a full backup, load via the UI, then let the pipeline handle increments from that point.
  • Never use auto‑detect for a table that an incremental pipeline will MERGE into – define the schema explicitly, and verify with INFORMATION_SCHEMA immediately after loading.
  • Add the QUALIFY ROW_NUMBER() dedup guard from day one – it costs nothing and saves you from mysterious MERGE failures later.
  • Test every API endpoint’s query‑parameter support independently – don’t inherit assumptions from one endpoint to another.
  • Backtick‑quote all identifiers in generated SQL – reserved‑word collisions are unpredictable and the fix is trivial.
  • Keep the raw and transformation layers separate – landing raw data in BigQuery with minimal transformation, then using a separate dbt layer for typing and renaming, makes debugging far easier. You can always re‑run dbt without re‑hitting the API.

Wrapping Up

None of the challenges described here were exotic: reserved words, duplicate rows, type mismatches, API inconsistencies, etc., are bread‑and‑butter data‑engineering problems. What made them feel hard was hitting them one at a time in production, under pressure, on a pipeline that analysts were waiting on.

The pipeline is now running reliably in production: tickets, contacts, threads, agents, teams, departments, and accounts are incrementally loaded daily.

If you’re building something similar — whether with Zoho, Salesforce, HubSpot, or any SaaS API — I hope these lessons save you a few hours of head‑scratching.

The pipeline is built with Apache Airflow, Google Cloud Storage, BigQuery, and dbt. The custom operator pattern and code‑generation approach described here are applicable to any REST API integration.

0 views
Back to Blog

Related posts

Read more »

Google Gemini Writing Challenge

What I Built - Where Gemini fit in - Used Gemini’s multimodal capabilities to let users upload screenshots of notes, diagrams, or code snippets. - Gemini gener...