performancedatabasesbest-practices

ClickHouse Performance Tuning: OLAP Best Practices for High-Throughput Analytics

UUnknown

2026-02-26

10 min read

Concrete ClickHouse tuning patterns and query optimizations to get predictable OLAP performance — practical examples, benchmarks and 2026 trends.

Hook: Why your analytics queries are unpredictable — and how to stop guessing

You built an OLAP pipeline on ClickHouse and it runs fast…sometimes. During spikes, complex GROUP BYs or JOINs suddenly blow up latency or OOM. Your dashboards show noisy 95th percentiles and SLAs you can’t promise. This article gives concrete, repeatable tuning patterns and query-level optimizations to make ClickHouse OLAP performance predictable at scale in 2026.

Quick context (2026): Why ClickHouse matters now

ClickHouse adoption exploded across modern analytics stacks in 2023–2025 and continued in 2026 as enterprises seek sub-second analytics at lower cost than cloud-only warehouses. Investment and ecosystem growth (notably a large funding round in late 2025) accelerated core features, cloud offerings and operational integrations.

That means teams can expect new functionality (improved projections, richer compression, more robust externalization-to-disk strategies) — but it also means operational complexity increases. To get stable production performance you need patterns, not guesswork.

What predictable OLAP performance looks like

Stable latency percentiles across representative workloads (p50/p95/p99)
Linear throughput scaling as nodes are added to a cluster
Few operational surprises (no memory spikes, manageable merges, bounded disk use)

Core tuning patterns (rules of thumb)

Design partitions for natural time windows and low cardinality.
Choose ORDER BY (sorting key) to match most selective filters and join keys.
Use PREWHERE aggressively to reduce I/O for common predicates.
Apply projections or materialized aggregated tables for hot rollups.
Set sensible limits (max_threads, max_memory_usage) and enable external grouping/joins to avoid OOM.
Benchmark repeatedly using representative datasets; track system.profile_events and system.metrics.

Pattern 1 — Partitioning that reduces scan variance

Partitioning controls how much data the engine skips during query execution. In high-throughput OLAP, you want partitions that are large enough to reduce metadata overhead but small enough to prune effectively.

Partition by month for long retention (6–24 months). Use day partitions for very large daily volumes.
Avoid high-cardinality partitions (user_id, order_id). They create many tiny parts and hurt merge performance.

CREATE TABLE events (
  event_date Date,
  user_id UInt64,
  event_type String,
  payload JSON
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date) -- monthly partitions
ORDER BY (event_type, user_id, event_date)

Pattern 2 — ORDER BY (sorting key): the most impactful design choice

ClickHouse stores data sorted by the ORDER BY expression. This ordering enables range reads and efficient GROUP BY / join locality.

Place the most selective and frequently filtered columns first.
Include join keys and clustering columns used together in queries.
Accept trade-offs: wider ORDER BY means better pruning but larger merges on insert.

Example: analytics queries often filter by event_type and then by user_id. Order by those columns.

Pattern 3 — Index granularity and sparse indexes

The primary sparse index (index_granularity) determines how many rows are indexed between marks. Tuning it trades read IO for index size and metadata overhead.

Smaller index_granularity (e.g., 4096) improves pruning and small-range queries at the cost of more index marks and slightly larger memory requirements.
Larger index_granularity (e.g., 16384) reduces mark count and improves write throughput but increases scanned data for range queries.

CREATE TABLE events (
  ...
) ENGINE = MergeTree()
ORDER BY (event_type, user_id, event_date)
SETTINGS index_granularity = 4096;

Predictable query latency is about minimizing variance — smaller partitions, correct ordering, and aggressive predicate pushdown narrow the performance distribution.

Pattern 4 — Use PREWHERE and projections for I/O reduction

PREWHERE is executed before reading column data and the sparse index is applied; use it for highly selective predicates on low-cardinality columns.

SELECT event_type, count() FROM events
PREWHERE event_type = 'purchase'
WHERE event_date >= '2026-01-01'
GROUP BY event_type;

Projections (precomputed physical structures inside a table) or materialized views provide deterministic performance for recurring aggregations. In 2026, projections matured and are a recommended replacement for ad-hoc rollup tables in many cases.

ALTER TABLE events ADD PROJECTION monthly_users
(SELECT toYYYYMM(event_date) AS ym, user_id, count() AS cnt
 GROUP BY ym, user_id);

Pattern 5 — Joins and GROUP BY: avoid surprises

JOINs are frequent sources of spikes — control memory and spill-to-disk behavior.

Prefer small lookup tables as dictionary objects when applicable (they’re memory-optimized and fast).
For large joins, set externalization limits to allow disk-based hash tables rather than OOM:

SET max_bytes_before_external_group_by = 3000000000;
SET max_bytes_before_external_join = 3000000000;

Use JOIN hinting carefully: hash joins are fast for equality, but if memory is tight, allow block-wise/merge joins or break joins into staged subqueries.

Pattern 6 — Compression codecs: choose columns wisely

Per-column codecs reduce storage and IO. Use stronger compression for infrequently-read columns and CPU-friendly codecs for hot columns.

Numeric telemetry columns: CODEC(ZSTD(1-3)) often balances CPU/IO.
High cardinality strings used in filters: CODEC(LZ4) for faster decompression.
Binary blobs or JSON: CODEC(ZSTD(9)) if rarely queried.

CREATE TABLE events (
  event_date Date,
  user_id UInt64,
  event_type String CODEC(LZ4),
  payload String CODEC(ZSTD(5))
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_type, user_id);

Pattern 7 — Merges, parts, and operational hygiene

ClickHouse compacts small parts into larger parts in background merges. Too many small parts cause IO storms and slow queries. Keep part count low per partition.

Monitor system.parts and aim for hundreds (not thousands) of parts per shard.
Use OPTIMIZE TABLE ... FINAL for urgent consolidation during maintenance windows.
Avoid frequent small inserts — batch them or use buffer tables / asynchronous inserts.

-- Consolidate a partition during low-traffic window
OPTIMIZE TABLE events PARTITION '202601' FINAL;

Pattern 8 — Resource limits and safe defaults

Hitting OOM kills queries and ruins SLAs. Configure global and per-user settings to bound resource consumption.

Set per-user limits in users.xml or profiles: max_memory_usage, max_execution_time.
Limit concurrency: max_concurrent_queries and max_threads tuned to CPU cores.

-- Example runtime overrides for a high concurrency analytics user
SET max_threads = 16;
SET max_memory_usage = 8000000000; -- 8 GB
SET max_execution_time = 60; -- seconds

Concrete query-level optimizations

1. Push down predicates and use PREWHERE

Move the most selective predicates into PREWHERE. It reduces the rows read and the number of columns loaded.

2. Select only needed columns

ClickHouse reads columns independently. Avoid SELECT * in production dashboards. Narrow column projection reduces IO and decompression cost.

3. Use LIMIT as an optimization when applicable

When combined with ORDER BY on the sorting key, LIMIT lets ClickHouse stop early and return results faster. For pagination over large sorts, consider using seek-based pagination (WHERE sorting_key > last_val LIMIT N) to avoid deep offsets.

4. Prefer arrayJoin and functions carefully

arrayJoin can explode the number of rows. Use it only when needed and combined with strict predicate pruning.

5. Break complex queries into steps

For huge joins and aggregations, break the work into deterministic stages and persist intermediate results (temporary tables or MATERIALIZED VIEW). This reduces variance and makes retries predictable.

Benchmarking: a reproducible methodology

Optimizations without measurement are speculation. Use a disciplined benchmarking plan.

Benchmark plan (repeatable)

Generate synthetic datasets that mirror cardinality, event rates and query shapes. Use clickhouse-client to load CSV/Parquet or tools like Yandex/tests.
Define representative query templates: point lookups, time-series rollups, large GROUP BY, high-cardinality joins.
Use clickhouse-benchmark or a custom driver to run steady-state mixes (RPS, concurrency, and warm/cold cache phases).
Collect metrics: system.query_log, system.metrics, iostat, and per-node CPU/memory. Track p50/p95/p99 latencies, throughput, and disk IO.
Compare variants: partitioning, index_granularity, ORDER BY changes, codec changes, and projection vs materialized view.

Example benchmark script (skeleton)

# generate data (pseudo)
python gen_events.py --rows 100_000_000 --out events.parquet
clickhouse-client --query="INSERT INTO events FORMAT Parquet" < events.parquet

# benchmark template
clickhouse-benchmark -c /etc/clickhouse-client/config.xml \
  --query "SELECT event_type, count() FROM events PREWHERE event_date >= '2026-01-01' GROUP BY event_type" \
  -t 1000 -n 50000

Interpreting results

High p95/p99 relative to p50 → high variance caused by occasional large reads or merges.
Throughput plateaus despite adding nodes → check data skew, partitioning, or network bottlenecks.
Memory spikes during JOINs → increase external join thresholds or rewrite query plan.

Illustrative benchmark findings (hypothetical example)

We ran a synthetic analytics workload on a 4-node NVMe cluster (16 vCPU, 64 GB RAM each) with 100M rows. Key findings:

Baseline (monthly partition, order by user_id): p95 = 850 ms for typical rollup queries.
Changed ORDER BY to (event_type, user_id) and enabled PREWHERE on event_type: p95 dropped to 210 ms.
Added a projection for monthly rollups: p95 downshifted to 30–60 ms and CPU usage dropped by ~40%.
Smaller index_granularity (4096 → 2048) improved point lookup latency by ~25% but increased memory index footprint by ~18%.

These numbers are illustrative — your results depend on data shape, hardware and query mix. The point: try design changes in a reproducible benchmark before production rollout.

Operational checklist before production

Run a representative benchmark and capture p50/p95/p99.
Set user profiles with resource caps (max_memory, concurrency, timeouts).
Schedule regular OPTIMIZE and monitor system.parts to avoid part explosion.
Use partitions aligned to retention and make TTLs explicit for data lifecycle.
Enable query logging and set up alerts on spikes in query time or part counts.

2026 trends and what to watch

Projections and built-in pre-aggregations will continue to replace many hand-built rollup tables. Use projections where deterministic latency matters.
Cloud-managed ClickHouse services now offer autoscaling and tiered storage; architect for hot/cold splits — hot NVMe for current partitions and cheaper HDD/Object for historical data.
Better externalized joins and disk-backed algorithms are becoming standard. Tuning externalization thresholds reduces OOMs without sacrificing latency drastically.
Observability improvements: more detailed profile_events and query tracing make root cause analysis faster in 2026 than earlier years.

Common pitfalls and how to avoid them

Too many tiny partitions: consolidate, use buffer layer for small events.
Using ORDER BY with wrong leading column: match most common filter patterns.
Relying on LIMIT instead of correct keys: yields brittle pagination and inconsistent performance.
Mutations for frequent updates: prefer insert-delete patterns or TTL-based updates for append-heavy workloads.

Actionable takeaways (cheat sheet)

Partition by time windows (month/day) and keep cardinality low.
Order by the most selective filter and then the join keys.
Use PREWHERE to cut IO early; select only needed columns.
Use projections/materialized views for hot rollups and predictable latency.
Benchmark changes with clickhouse-benchmark and system metrics.
Protect the cluster with per-user resource caps and externalization thresholds.

Where to start this week

Identify the top 10 slowest queries from system.query_log and group by patterns.
For each, check whether a PREWHERE, projection, or different ORDER BY would reduce scanned rows.
Run a controlled benchmark against a snapshot of your data and compare two versions: current vs tuned.

Final thoughts

ClickHouse gives you extraordinary OLAP performance, but only if the data layout and query shapes are aligned. In 2026, with richer features and managed services, the emphasis shifts from raw capability to predictable operational behavior. Use the tuning patterns above as a playbook: design for low variance, guard resources, measure continuously, and prefer precomputation for hot workloads.

Call to action

Ready to reduce p99 surprises? Start with a single high-variance query: run the benchmark plan above, try a projection or an ORDER BY change, and compare p95/p99. If you want, download a ready-to-run benchmark repo (contains data generator, queries, and dashboards) from our toolkit and run it in a staging cluster. Share your results with the ClickHouse community or hire experts to harden your cluster for production SLAs.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.