skillby canopy-network
clickhouse
ClickHouse query optimization and schema design. Use when writing, optimizing, or reviewing ClickHouse queries and table schemas.
Installs: 0
Used in: 1 repos
Updated: 2d ago
$
npx ai-builder add skill canopy-network/clickhouseInstalls to .claude/skills/clickhouse/
# ClickHouse Query Optimization Skill Guidelines for optimal ClickHouse query performance and schema design. **MUST** rules are critical for performance; **SHOULD** rules are strong recommendations; **CAN** rules are optional optimizations. ## 1 — Columnar Storage Fundamentals - **COL-1 (MUST)**: Never use `SELECT *` — only select columns you need - **COL-2 (MUST)**: Understand that each column is stored as a separate file on disk - **COL-3 (SHOULD)**: Design queries to read minimal columns for maximum performance **Key insight**: "The less data you read, the faster your queries run." ## 2 — Granules: The Fundamental Unit - **GR-1 (MUST)**: Understand granules are 8,192 rows by default (configurable) - **GR-2 (MUST)**: Know that granules are indivisible — if one row matches, all 8,192 rows are read - **GR-3 (SHOULD)**: Design primary keys to maximize granule skipping **Key insight**: Sparse indexes store the first value of every 8,192nd row. ClickHouse performs binary search to identify matching granules, then reads only those granules. ## 3 — Primary Key Design (ORDER BY) **This is the single most important optimization — proper ORDER BY selection can improve performance by 100× or more.** - **PK-1 (MUST)**: Include columns that appear most frequently in WHERE clauses - **PK-2 (MUST)**: Order columns by cardinality — **lower cardinality first** - **PK-3 (MUST)**: Include time components for time-series workloads - **PK-4 (SHOULD)**: Avoid high-cardinality columns (UUIDs) as first column ### Cardinality Ordering Rationale | Position | Search Algorithm | Cardinality Impact | |----------|-----------------|-------------------| | First column | Binary search O(log₂ n) | Works efficiently regardless of cardinality | | Secondary columns | Exclusion search | Effectiveness depends on predecessor cardinality | **Bad**: `ORDER BY (user_uuid, timestamp)` — UUID as first column randomizes secondary column values across granules **Good**: `ORDER BY (tenant_id, timestamp, user_uuid)` — Low cardinality first enables effective pruning ### Compound Key Query Support Key `(user_id, timestamp, event_type)` efficiently supports: - `WHERE user_id = X` — highly efficient (uses first column) - `WHERE user_id = X AND timestamp > Y` — efficient (uses first two columns) - `WHERE timestamp > Y` — less efficient (skips first column) ## 4 — Query Optimization Techniques ### PREWHERE Clause - **PRE-1 (SHOULD)**: Use PREWHERE for highly selective filter conditions - **PRE-2 (SHOULD)**: Use PREWHERE when filter columns are smaller than SELECT columns PREWHERE reads only filter condition columns first, evaluates the condition, then reads remaining columns only for matching rows. ### Approximate Functions - **APX-1 (SHOULD)**: Use `uniq()` instead of `COUNT(DISTINCT)` when approximate results are acceptable - **APX-2 (CAN)**: Use HyperLogLog-based functions for 10-100× performance improvement Approximate functions use ~12 KB fixed memory vs O(n) for exact counts. ### Hashing Long Strings - **HASH-1 (SHOULD)**: When grouping by long strings, hash to fixed-size integers (5-10× faster) ## 5 — Data Type Optimization - **DT-1 (MUST)**: Avoid Nullable columns when possible (requires additional UInt8 null-mask column) - **DT-2 (MUST)**: Use `LowCardinality(String)` for columns with <10,000 unique values - **DT-3 (SHOULD)**: Choose smallest appropriate types (UInt8 vs Int64 saves 7 bytes per row) ### LowCardinality Benefits - Dictionary encoding applied automatically - Significant compression and query performance gains - Ideal for status fields, country codes, category names ## 6 — Denormalization - **DN-1 (SHOULD)**: Prefer denormalized data over JOINs - **DN-2 (SHOULD)**: Accept increased storage for eliminated JOIN overhead **Key insight**: ClickHouse performs best with denormalized data. Columnar compression mitigates storage increase while eliminating hash table construction overhead. ## 7 — Advanced Optimizations ### Materialized Views - **MV-1 (SHOULD)**: Use materialized views for pre-computed aggregates - **MV-2 (CAN)**: Leverage incremental updates at insert time rather than query time ### Projections - **PROJ-1 (CAN)**: Maintain multiple physical sort orders for different query patterns - **PROJ-2 (CAN)**: Use projections for pre-computed aggregates ### Skip Indexes - **SKIP-1 (SHOULD)**: Use skip indexes as last resort after primary key and projection optimization - **SKIP-2 (CAN)**: Consider bloom filters for high-cardinality column filtering ### Dictionaries - **DICT-1 (CAN)**: Use in-memory dictionaries for O(1) lookup times - **DICT-2 (CAN)**: Replace JOINs with dictionary lookups where applicable ## 8 — Monitoring & Debugging - **MON-1 (MUST)**: Use `system.query_log` for query performance analysis - **MON-2 (MUST)**: Use `EXPLAIN indexes = 1` to verify granule selection - **MON-3 (SHOULD)**: Monitor granule ratio (granules selected / total granules) - **MON-4 (SHOULD)**: Measure, change one variable, measure again ### Key Metrics to Monitor | Metric | Source | What to Look For | |--------|--------|-----------------| | Rows read | query_log | Should be close to result set size | | Granules selected | EXPLAIN | Lower ratio = better pruning | | Memory usage | query_log | Watch for excessive allocation | | Execution time | query_log | Track query patterns | ## 9 — Query Execution - **EX-1 (SHOULD)**: Understand ClickHouse uses all CPU cores by default - **EX-2 (CAN)**: Each thread processes independent data ranges in parallel - **EX-3 (CAN)**: Use multi-pass grouping for large GROUP BY operations ### Execution Pipeline 1. **Parsing and Analysis** — Create execution plan 2. **Optimization** — Prune unnecessary data via sparse index 3. **Pipeline Execution** — Parallel processing across CPU cores 4. **Final Processing** — Merge results ## Performance Comparison Example ### Without Optimization (wrong ORDER BY) ```sql -- ORDER BY (postcode, address), querying WHERE town = 'LONDON' -- Rows processed: 27.64 million (full table scan) -- Time: 44ms -- Data read: 44.21 MB ``` ### With Optimization (correct ORDER BY) ```sql -- ORDER BY (town, price) -- Rows processed: 81,920 (selected granules only) -- Time: 5ms (8.8× faster) -- Data read: 13.03 MB (3.4× reduction) ``` ## Review Checklist When reviewing ClickHouse schemas and queries: 1. **Column Selection**: Are only necessary columns selected? 2. **ORDER BY Design**: Is cardinality ordering correct (low to high)? 3. **Time Component**: Does ORDER BY include timestamp for time-series data? 4. **Data Types**: Are Nullable avoided? Is LowCardinality used appropriately? 5. **PREWHERE**: Would PREWHERE benefit selective filters? 6. **Granule Efficiency**: Does EXPLAIN show good granule skip ratio? 7. **Denormalization**: Are JOINs avoided where possible? 8. **Approximate Functions**: Could approximate functions replace exact counts?
Quick Install
$
npx ai-builder add skill canopy-network/clickhouseDetails
- Type
- skill
- Author
- canopy-network
- Slug
- canopy-network/clickhouse
- Created
- 6d ago