Performance Analysis of ClickHouse Queries

Performance Analysis
of ClickHouse Queries

What is ClickHouse?

Open-source distributed analytical DBMS.

https://github.com/yandex/ClickHouse/

Used in: CARTO, Mercadona, Zara, Idealista, Corunet, ... Cloudflare, Spotify, Amadeus, Bloomberg, Cisco, Deutsche Bank, Tencent, ByteDance...

Query performance analysis

Trivial facts

... and common truths.

What hardware resources do the system use?

— CPU;

— RAM;

— disk;

— network.

... also the performance can be bounded by "sleep".

What hardware resources do the system use?

Resource utilization — is an aggregated value,
that usually cannot show the full picture.

If the utilization is "100%"
— it doesn't necessarily mean that you cannot make performance better.

If the utilization is less than "100%"
— it doesn't necessarily mean, that it's possible to do better.

Example 1, CPU

Look at CPU utilization in top, htop, etc.

If it shows that CPU core is loaded 100% — it means that CPU is processing instructions, but:

— CPU can process different number of instructions per clock (IPC);

— CPU can spent time waiting for data to be loaded from memory or cache,
     but this time is also accounted as CPU utilization (perf record);

— CPU can work with lowered clock frequency (turbostat, dmesg);

Example 1, CPU

If it show that CPU (sum for all cores) is loaded for 50%:

— it's possible that it is enough,
     if the Hyper-Threading does not make sense for your code;

— it's possible that you see some average metric for time interval, but
     CPU is fully utilized at smaller intervals processing quick-response interactive queries.

Time is spent for what? What is the bottleneck for the query?

Example 2, disks

... and SSD.

A program spend all its time for disk read,
but iostat shows utilization less than 100%:
— you use RAID, but the program performs reads with not big enough
     buffer size with O_DIRECT or without readahead — disks
     are used sequentialy rather than in parallel;
— you use RAID-10, but it has near, rather than far layout in mdadm;
— you use RAID-5/6, but have wrong stripe cache size;
— you do random reads with not enough level of paralellism.
      (for SSD always use AIO (io_submit) to perform random reads).

And finally: always compress data stored on disk.

Example 3, RAM

Linux shows that there is almost no "free" RAM:
— keep calm and don't look at "free" memory: https://www.linuxatemyram.com/.

Top show that the process allocated more memory than exists in the system:
— don't look at VIRT memory (large size of address space
— it is fine, better to look at RSS).

Memory usage graph shows that the process doesn't release memory:
— modern memory allocators (almost) don't release memory back to OS directly.

And finally: always disable swap;

Example 4, network

A program spent all its time transferring data over the network,
but network is not utilized enough:

— probably you create a new TCP connection
for every data transfer;

— probably you use a single TCP connection
to transfer files over the network with high packet loss;

— probably you send data by small chunks
with TCP_NODELAY enabled.

And finally: always compress data transferred over the network.

 

Basic toolset

top, htop: CPU, memory by process;

dstat: IO, network, ...;

iostat: IO by devices;

iotop: IO by process;

iftop: network by hosts;

perf top: CPU by functions inside processes;

Example: top by threads

Example: clickhouse-benchmark + perf top

Resource usage observability
inside ClickHouse

What is happening right now?

SHOW PROCESSLIST

Или SELECT * FROM system.processes

SELECT * FROM system.merges

«clickhouse top»:
watch -n1 'clickhouse-client --query="SHOW PROCESSLIST"'

What had happened before?

system.query_log
— can be enabled globally, for user, for session or for query;
— a setting: log_queries = 1;
— queries are logged twice: at the beginning and at the end of execution.

system.part_log
— can be enabled globally in config.xml;
— all operations on MergeTree data parts are logged.

System tables with logs are updated asynchronously.
(each 7 seconds). SYSTEM FLUSH LOGS to force write.

What can you do with query_log?

— select a list of queries
and perform load testing with clickhouse-benchmark:

clickhouse-client --query=" SELECT query FROM system.query_log WHERE type = 2 AND event_date = today() " > queries.tsv

clickhouse-benchmark < queries.tsv

— find the first query (root cause) that was lead to blow-up;

— create a report of resource usage by users.

Query tracing

SET send_logs_level = 'trace'

— logs will be sent from all nodes
participated in query processing
(distributed tracing);

Resource metrics

Global:
— system.events;
— system.metrics;
— system.asynchronous_metrics.

For each query:
— system.processes;
— system.query_log.

For each query execution thread:
— system.query_thread_log.

Resource metrics

User-space:
— FileOpen;
— NetworkReceiveElapsedMicroseconds;
— ReadBufferFromFileDescriptorReadBytes...

Kernel-space:
— OSReadChars;
— OSReadBytes;
— OSIOWaitMicroseconds;
— OSCPUWaitMicroseconds;
— UserTimeMicroseconds;
— SoftPageFaults...

Per-query resource metrics

MetricFirst runSecond run
Total time6.7 sec1.3 sec
...FileDescriptorReadBytes2.56 GiB2.56 GiB
ReadChars2.56 GiB2.56 GiB
ReadBytes3.23 GiB3.13 MiB
IOWait87.3 sec0.023 sec
CPUWait0.027 sec0.045 sec
UserTime9.9 sec13.4 sec
SystemTime2.4 sec2.2 sec

.

Web site: https://clickhouse.yandex/

GitHub: https://github.com/yandex/ClickHouse/ (now with 7200 stars)

Twitter: https://twitter.com/ClickHouseDB

Telegram chat: https://telegram.me/clickhouse_en and https://telegram.me/clickhouse_ru (now with over 2000 members)

Google groups: https://groups.google.com/forum/#!forum/clickhouse

Maillist: clickhouse-feedback@yandex-team.com

+ meetups. Moscow, Saint-Petersburg, Novosibirsk, Ekaterinburg, Minsk, Nizhny Novgorod, Berlin, Palo Alto, Beijing, Sunnyvale, San Francisco, Paris, Amsterdam, Madrid, Limassol, Shenzhen, Shanghai...