Two ClickHouse Internals That Change How You Write Queries

Published: (February 17, 2026 at 01:27 AM EST)
5 min read
Source: Dev.to

Source: Dev.to

Most ClickHouse mistakes don’t come from SQL syntax.

They come from using the wrong mental model.

ClickHouse looks like a familiar analytical SQL database, but internally it behaves very differently from traditional row‑based systems. If you approach it with PostgreSQL or MySQL instincts, you’ll eventually hit confusing behavior: wrong results, strange duplicates, unexpected merges, or queries that don’t scale the way you expected.

Two internals in particular completely changed how I design tables and write queries:

  • AggregatingMergeTree stores aggregation states, not final values
  • argMax solves group‑wise max problems the way ClickHouse expects you to

These aren’t tricks. They’re core to how the engine works.

1. AggregatingMergeTree Is About States, Not Values

One of the most misunderstood engines in ClickHouse is AggregatingMergeTree.

At first glance, it looks like a way to store pre‑aggregated results. So it’s tempting to do something like this:

CREATE TABLE daily_metrics
(
    day Date,
    avg_delay Float32
)
ENGINE = AggregatingMergeTree()
ORDER BY day;

Then insert data like:

SELECT
    toDate(event_time) AS day,
    avg(arrival_delay) AS avg_delay
FROM flights
GROUP BY day;

And later query it with:

SELECT day, avg_delay FROM daily_metrics;

This looks reasonable. But conceptually, it’s wrong.

Why This Fails

AggregatingMergeTree does not store final aggregated values. It stores aggregation states – intermediate, mergeable representations of computations.

ClickHouse merges parts in the background. As new data arrives:

  • Rows are appended to new parts
  • Background merges combine parts asynchronously
  • Aggregation states are merged during these part merges

If you store plain values like avg() or sum(), ClickHouse has nothing mergeable to work with. You’ve effectively disabled the engine’s intended behavior.

The Correct Mental Model

Think in states, not values.

  • Instead of storing avg(), store avgState().
  • Instead of storing sum(), store sumState().
  • Instead of storing uniq(), store uniqState().

Define your table like this:

CREATE TABLE daily_metrics
(
    day Date,
    avg_delay AggregateFunction(avg, Float32)
)
ENGINE = AggregatingMergeTree()
ORDER BY day;

Insert data like this:

SELECT
    toDate(event_time) AS day,
    avgState(arrival_delay) AS avg_delay
FROM flights
GROUP BY day;

Now you are storing mergeable states.

When querying, finalize them:

SELECT
    day,
    avgMerge(avg_delay) AS avg_delay
FROM daily_metrics
GROUP BY day;

Why GROUP BY Is Still Required

Another common confusion is:

“If I already aggregated before inserting, why do I still need GROUP BY when querying?”

Because merges are asynchronous. You may have multiple parts containing states for the same day. The final aggregation only happens when you explicitly merge those states using functions like avgMerge(). GROUP BY ensures that all states for the same key are combined correctly at query time.

Once you understand that AggregatingMergeTree is designed for incremental, mergeable aggregation, everything becomes predictable:

  • Materialized views make sense
  • Backfills behave correctly
  • Background merges are no longer mysterious

The engine isn’t broken. The mental model was.

2. Solving Group‑Wise Max Correctly with argMax

Another common analytical requirement looks simple on the surface:

“For each group, return the row with the maximum value.”

For example: “For each airline, return the flight with the worst arrival delay.”

A naive approach might look like this:

SELECT
    airline,
    flight_number,
    max(arrival_delay)
FROM flights
GROUP BY airline, flight_number;

This computes the maximum delay per (airline, flight_number) combination – not the worst flight per airline. It also produces multiple rows per airline, which means extra filtering or subqueries later.

The ClickHouse‑Native Solution

SELECT
    airline,
    argMax(flight_number, arrival_delay) AS flight_number,
    max(arrival_delay) AS max_delay
FROM flights
GROUP BY airline;

argMax(value, weight) means: “Return the value associated with the maximum weight.”

In this case:

  • Track the maximum arrival_delay per airline
  • Return the flight_number corresponding to that maximum
  • Produce exactly one row per airline

No subqueries, no joins, no row explosion.

Why This Matters

argMax aligns with how ClickHouse performs aggregation internally. It combines value selection and aggregation into a single pass. Instead of thinking in terms of relational rewrites, you’re using an aggregate function that expresses the intent directly. This is both faster and clearer.

Once you start using argMax, many complex “best per group” queries collapse into a single SELECT.

The Bigger Pattern

Both of these examples point to the same lesson:

ClickHouse is not just “Postgres, but faster.”
It is designed around:

  • Immutable parts
  • Background merges
  • Mergeable aggregation states
  • Columnar execution
  • Specialized aggregate functions

If you treat it like a traditional row‑based database, you’ll keep fighting it. If you align your thinking with how it stores and merges data, entire classes of bugs and inefficiencies disappear.

Thoughts

Understanding AggregatingMergeTree as a state‑based engine—and using functions like argMax for group‑wise logic—changes how you model data in ClickHouse.

These aren’t edge cases. They show up in real production systems:

  • Pre‑aggregated materialized views
  • Streaming ingestion pipelines
  • Analytical dashboards
  • Per‑tenant metrics
  • Best/worst reporting

ClickHouse rewards engineers who understand its internals.

Once you stop thinking in values and start thinking in states—and once you use the aggregate functions the engine was built around—query design becomes simpler and more predictable.

And performance follows naturally.

0 views
Back to Blog

Related posts

Read more »

SCHEMAS & DATA MODELLING IN POWER BI

Introduction Effective data management is a critical component of any organisation that aids in smarter, data‑driven business decisions. At the heart of this l...