Hidden Bottleneck in ClickHouse Query Planning Identified by Cloudflare
Cloudflare traced a major slowdown in its billing pipeline to hidden lock contention inside ClickHouse’s query planning layer after a partitioning change altered execution behavior at scale. Traditional database metrics did not clearly expose the issue, which was ultimately identified using flame graph analysis. The fix involved replacing an exclusive lock with a shared lock in the query planner to reduce contention under heavy concurrency.

Summary
Cloudflare traced a major slowdown in its billing pipeline to hidden lock contention inside ClickHouse’s query planning layer after a partitioning change altered execution behavior at scale. Traditional database metrics did not clearly expose the issue, which was ultimately identified using flame graph analysis. The fix involved replacing an exclusive lock with a shared lock in the query planner to reduce contention under heavy concurrency.
Key Updates
- Cloudflare operates ClickHouse across infrastructure managing more than one hundred petabytes of data.
- A partitioning change unexpectedly introduced severe lock contention inside the query planning layer.
- Standard operational metrics did not clearly identify the root cause of the slowdown.
- Engineers used flame graph analysis to isolate contention within the planner.
- Replacing an exclusive lock with a shared lock reduced planner contention and restored pipeline performance.
Why It Matters
The deeper signal is not the specific ClickHouse bug, but the growing complexity of diagnosing coordination failures inside large-scale distributed systems.
As infrastructure scales, bottlenecks increasingly emerge from orchestration layers, scheduling behavior, lock contention, and internal coordination overhead rather than obvious resource exhaustion. These failure modes are often difficult to detect through conventional dashboards because high-level metrics may appear healthy while execution stalls internally.
Cloudflare’s investigation also reinforces an important operational lesson: scaling changes that appear isolated — such as partitioning adjustments — can alter concurrency behavior across deeper execution paths in unexpected ways.
The use of flame graphs here is notable because it highlights the continuing importance of low-level execution visibility even in highly instrumented modern infrastructure stacks.
Builder Takeaway
Builders operating large data systems should not assume standard telemetry is sufficient for diagnosing performance regressions under concurrency.
Partitioning changes, planner behavior, and synchronization primitives can become hidden operational risk surfaces at scale. Teams running analytical infrastructure should evaluate whether their observability tooling provides enough visibility into execution paths, lock contention, and coordination overhead — not just CPU, memory, and query latency metrics.
How strong is this signal for builders?
Signal feedback is stored anonymously and used to improve Tech Radar editorial quality.
Want more builder-focused AI and infrastructure signals?
Follow UniQubit Tech Radar or contact UniQubit about the systems you are building.