Name: clickhouse
Author: canopy-network
# ClickHouse Query Optimization Skill

Guidelines for optimal ClickHouse query performance and schema design. **MUST** rules are critical for performance; **SHOULD** rules are strong recommendations; **CAN** rules are optional optimizations.

## 1 — Columnar Storage Fundamentals

- **COL-1 (MUST)**: Never use `SELECT *` — only select columns you need
- **COL-2 (MUST)**: Understand that each column is stored as a separate file on disk
- **COL-3 (SHOULD)**: Design queries to read minimal columns for maximum performance

**Key insight**: "The less data you read, the faster your queries run."

## 2 — Granules: The Fundamental Unit

- **GR-1 (MUST)**: Understand granules are 8,192 rows by default (configurable)
- **GR-2 (MUST)**: Know that granules are indivisible — if one row matches, all 8,192 rows are read
- **GR-3 (SHOULD)**: Design primary keys to maximize granule skipping

**Key insight**: Sparse indexes store the first value of every 8,192nd row. ClickHouse performs binary search to identify matching granules, then reads only those granules.

## 3 — Primary Key Design (ORDER BY)

**This is the single most important optimization — proper ORDER BY selection can improve performance by 100× or more.**

- **PK-1 (MUST)**: Include columns that appear most frequently in WHERE clauses
- **PK-2 (MUST)**: Order columns by cardinality — **lower cardinality first**
- **PK-3 (MUST)**: Include time components for time-series workloads
- **PK-4 (SHOULD)**: Avoid high-cardinality columns (UUIDs) as first column

### Cardinality Ordering Rationale

| Position | Search Algorithm | Cardinality Impact |
|----------|-----------------|-------------------|
| First column | Binary search O(log₂ n) | Works efficiently regardless of cardinality |
| Secondary columns | Exclusion search | Effectiveness depends on predecessor cardinality |

**Bad**: `ORDER BY (user_uuid, timestamp)` — UUID as first column randomizes secondary column values across granules
**Good**: `ORDER BY (tenant_id, timestamp, user_uuid)` — Low cardinality first enables effective pruning

### Compound Key Query Support

Key `(user_id, timestamp, event_type)` efficiently supports:
- `WHERE user_id = X` — highly efficient (uses first column)
- `WHERE user_id = X AND timestamp > Y` — efficient (uses first two columns)
- `WHERE timestamp > Y` — less efficient (skips first column)

## 4 — Query Optimization Techniques

### PREWHERE Clause

- **PRE-1 (SHOULD)**: Use PREWHERE for highly selective filter conditions
- **PRE-2 (SHOULD)**: Use PREWHERE when filter columns are smaller than SELECT columns

PREWHERE reads only filter condition columns first, evaluates the condition, then reads remaining columns only for matching rows.

### Approximate Functions

- **APX-1 (SHOULD)**: Use `uniq()` instead of `COUNT(DISTINCT)` when approximate results are acceptable
- **APX-2 (CAN)**: Use HyperLogLog-based functions for 10-100× performance improvement

Approximate functions use ~12 KB fixed memory vs O(n) for exact counts.

### Hashing Long Strings

- **HASH-1 (SHOULD)**: When grouping by long strings, hash to fixed-size integers (5-10× faster)

## 5 — Data Type Optimization

- **DT-1 (MUST)**: Avoid Nullable columns when possible (requires additional UInt8 null-mask column)
- **DT-2 (MUST)**: Use `LowCardinality(String)` for columns with <10,000 unique values
- **DT-3 (SHOULD)**: Choose smallest appropriate types (UInt8 vs Int64 saves 7 bytes per row)

### LowCardinality Benefits

- Dictionary encoding applied automatically
- Significant compression and query performance gains
- Ideal for status fields, country codes, category names

## 6 — Denormalization

- **DN-1 (SHOULD)**: Prefer denormalized data over JOINs
- **DN-2 (SHOULD)**: Accept increased storage for eliminated JOIN overhead

**Key insight**: ClickHouse performs best with denormalized data. Columnar compression mitigates storage increase while eliminating hash table construction overhead.

## 7 — Advanced Optimizations

### Materialized Views

- **MV-1 (SHOULD)**: Use materialized views for pre-computed aggregates
- **MV-2 (CAN)**: Leverage incremental updates at insert time rather than query time

### Projections

- **PROJ-1 (CAN)**: Maintain multiple physical sort orders for different query patterns
- **PROJ-2 (CAN)**: Use projections for pre-computed aggregates

### Skip Indexes

- **SKIP-1 (SHOULD)**: Use skip indexes as last resort after primary key and projection optimization
- **SKIP-2 (CAN)**: Consider bloom filters for high-cardinality column filtering

### Dictionaries

- **DICT-1 (CAN)**: Use in-memory dictionaries for O(1) lookup times
- **DICT-2 (CAN)**: Replace JOINs with dictionary lookups where applicable

## 8 — Monitoring & Debugging

- **MON-1 (MUST)**: Use `system.query_log` for query performance analysis
- **MON-2 (MUST)**: Use `EXPLAIN indexes = 1` to verify granule selection
- **MON-3 (SHOULD)**: Monitor granule ratio (granules selected / total granules)
- **MON-4 (SHOULD)**: Measure, change one variable, measure again

### Key Metrics to Monitor

| Metric | Source | What to Look For |
|--------|--------|-----------------|
| Rows read | query_log | Should be close to result set size |
| Granules selected | EXPLAIN | Lower ratio = better pruning |
| Memory usage | query_log | Watch for excessive allocation |
| Execution time | query_log | Track query patterns |

## 9 — Query Execution

- **EX-1 (SHOULD)**: Understand ClickHouse uses all CPU cores by default
- **EX-2 (CAN)**: Each thread processes independent data ranges in parallel
- **EX-3 (CAN)**: Use multi-pass grouping for large GROUP BY operations

### Execution Pipeline

1. **Parsing and Analysis** — Create execution plan
2. **Optimization** — Prune unnecessary data via sparse index
3. **Pipeline Execution** — Parallel processing across CPU cores
4. **Final Processing** — Merge results

## Performance Comparison Example

### Without Optimization (wrong ORDER BY)

```sql
-- ORDER BY (postcode, address), querying WHERE town = 'LONDON'
-- Rows processed: 27.64 million (full table scan)
-- Time: 44ms
-- Data read: 44.21 MB
```

### With Optimization (correct ORDER BY)

```sql
-- ORDER BY (town, price)
-- Rows processed: 81,920 (selected granules only)
-- Time: 5ms (8.8× faster)
-- Data read: 13.03 MB (3.4× reduction)
```

## Review Checklist

When reviewing ClickHouse schemas and queries:

1. **Column Selection**: Are only necessary columns selected?
2. **ORDER BY Design**: Is cardinality ordering correct (low to high)?
3. **Time Component**: Does ORDER BY include timestamp for time-series data?
4. **Data Types**: Are Nullable avoided? Is LowCardinality used appropriately?
5. **PREWHERE**: Would PREWHERE benefit selective filters?
6. **Granule Efficiency**: Does EXPLAIN show good granule skip ratio?
7. **Denormalization**: Are JOINs avoided where possible?
8. **Approximate Functions**: Could approximate functions replace exact counts?
clickhouse

Quick Install

Details

Tasks

Used In

Related skills