ClickHouse — distributed analytical column-oriented DBMS
Here is how "traditional" row-oriented system work:
Here is how column-oriented systems work:
— column-oriented DBMSs from Germany.
Initially we developed ClickHouse for Yandex.Metrica
Yandex.Metrica — №1 web analytics system in Russia, №2 in the world.
(~25 bln. events/day, we must store all of them
in non-aggregated form for customizable reports)
"Evolution of Data Structures in Yandex Metrica"
Nothing was suitable. Then we developed ClickHouse.
2010 — start of development
2012 — first production usages
2015 — launch of Metrica 2
2016 — open-source
In Yandex, ClickHouse has proven to be easy of use.
For example, we have detailed documentation from the beginning.
In 2012..2016, ClickHouse gained a widespread inside Yandex:
Mail, Comparison Shopping, Ads, Webmaster Tools, Infrastructure, Business Analytics...
Yandex analysts installed ClickHouse on their own VMs without any assistance.
* If you want to try ClickHouse, single server will be enough.
We think ClickHouse is too good to be used solely by Yandex.
We made it open-source. License: Apache 2.0.
For well structured, clean, immutable events.
Click stream. Web analytics. Adv. networks. RTB. E-commerce.
Analytics for online games. Sensor and monitoring data. Telecom data.
ClickHouse doesn't have UPDATE statement and full-featured transactions.
If you want high load of small single-row queries, please use another system.
Blob-store, document oriented
ClickHouse is intended for vast amount of fine-grained data.
Better to make up single wide fact table with pre-joined dimensions.
More than 100 companies are already using ClickHouse.
Examples: Cloudflare, Vertamedia, Mail.ru, Carto, Lifestreet...
How do we analyze over O(100B) DNS requests daily.
"ClickHouse enables us and our customers to explore the the dataset in real time to get operational insights. Due to many of the optimizations built into ClickHouse we are able to store the data for a long time allowing us to look events is perspective and at historical trends."
Bioinformatics — evolutionary genetics:
"We are exploring evolution of novel genes in genomes because if seems that genomes are far from being static as previously believed and what actually happens is that new genes are constantly being added and old genes are lost."
Search engine and analytics for Bitcoin transactions:
"We have quite large tables on just single server and everything works really fast — with any filters and sorting everything is processed just instantly."
CREATE TABLE table ON CLUSTER cluster ...
Now ClickHouse works with Tableau.
Limited support for table aliases and qualified identifiers.
SELECT t.column FROM table AS t
Support for aliases of functions and data types.
SELECT CAST(123 AS INT)
Allow both `MySQL` style and "ANSI" style quotes for identifiers.
SQL style escaping in string literals.
Type conversions in INSERT SELECT
Automatic failover for lagged replicas:
SET max_replica_delay_for_distributed_queries = 300,
fallback_to_stale_replicas_for_distributed_queries = 1;
Failover for replicas where table doesn't exist.
Manually disable replica from leader election:
Synchronous INSERTs into Distributed tables:
SET insert_distributed_sync = 1;
Allow to load parts of configuration from ZooKeeper:
<remote_servers from_zk="/path/to/node" />
Isolation of users in ZooKeeper (chroot, authentication, ACL).
SYSTEM RELOAD DICTIONARY 'my_dict';
Portable clickhouse binary:
milovidov@milovidov-xenial:~$ scp ./clickhouse milovidov-centos:~
Dictionary database and tables
CREATE DATABASE dictionaries ENGINE = Dictionary
Invalidate query for dictionaries
SELECT update_time FROM dictionary_source where id = 1
ip_trie type of dictionary (contributed by CloudFlare)
MySQL dictionary over socket;
over SSL connection (contributed by Mail.ru)
... WHERE pointInPolygon((lat, lon),
[(x1, y1), (x2, y2), (x3, y3)...])
substring with dynamic and negative arguments
concat for arrays, arraySlice
arrayPushBack, arrayPushFront, arrayPopBack, arrayPopFront
ALTER table CLEAR COLUMN IN PARTITION
— clear old data for specific column;
OPTIMIZE TABLE ... PARTITION ... DEDUPLICATE
— cleanup accidentially duplicated data
Limited support for ALTER of primary key columns:
Enum, Date <-> UInt16, DateTime <-> UInt32
Sessions in HTTP interface
curl 'http://host:port/?session_id=123&session_timeout=60' \
-d 'CREATE TEMPORARY TABLE t'
Allow to send progress info in HTTP interface
Support for HTTPS in clickhouse-server
On the way: Support for SSL in native protocol
WITH ... SELECT
Type conversions for Merge tables
Allow to skip errors in INSERT data: input_format_allow_errors_*
- dictionary for ranges of IP-addresses (CloudFlare)
- infer index use with pk subexpression (CloudFlare)
- sumMap aggregate function (CloudFlare)
- topK aggregate function (CloudFlare)
- fix for shuffle in choosing ZK server (CloudFlare)
- fix of UNION ALL (Unknown Chinese Company)
- SSL for MySQL dictionaries (Mail.ru)
- UUID data type and functions (Guillaume Tassery)
- functions for MAC addresses (Dmitry Luhtionov)
- findClusterIndex, findClusterValue (mfridental)
- ForEach combinator for aggregate functions (mfridental)
- 'none' compression method (Sematext Group, Inc.)
- update zstd (SAKURA Internet Inc.)
Almost ready :)
Nov .. Dec 2017
In design stage.
Resource pools (CPU, disk IO, network bandwidth)
Web site: https://clickhouse.yandex/
Google groups: https://groups.google.com/forum/#!forum/clickhouse
Telegram channel: https://telegram.me/clickhouse_en
and https://telegram.me/clickhouse_ru (now with 983 members)
Python, PHP, Perl, Go,
Node.js, Ruby, C++, .NET, Scala, R, Julia, Rust
Tabix (tabix.io) — specifically developed for ClickHouse.
Grafana, Redash, Apache Zeppelin,