Improve query performance with ClickHouse Data Skipping Index
13 June 2022
7 min read

At IBM Instana™, we process and store every single call collected by Instana tracers with no sampling over the last seven days. Instana’s Unbounded Analytics feature allows for the filtering and grouping of calls by arbitrary tags to gain insights into the unsampled, high-cardinality tracing data. We are able to provide accurate metrics—such as call count, latency percentiles or error rate—and display the detail of every single call.

For many of our large clients, over 1 billion calls are stored every day. This number now reaches 18 billion for our largest client and keeps growing. Calls are stored in a single table in ClickHouse and each call tag is stored in a column. Filtering this large number of calls, aggregating the metrics and returning the result within a reasonable time has always been a challenge.

Previously, we created materialized views to pre-aggregate calls by some frequently used tags, such as application/service/endpoint names or HTTP status code. However, we can’t include all tags into the view—especially those with high cardinalities—because it would significantly increase the number of rows in the materialized view and, therefore, slow down the queries.

Filtering on high-cardinality tags not included in the materialized view still requires a full scan of the calls table within the selected time frame, which could take over a minute.

Optimize filtering on HTTP URL

Filtering on HTTP URL is a very frequent use case. The cardinality of HTTP URLs can be very high since we could have randomly generated URL path segments, such as /api/product/{id} .

Choose the data skipping index
Configure the index

The tokenbf_v1  index needs to be configured with a few parameters. First, the index granularity specifies how many granules of data will be indexed together in a single block using a bloom filter. The entire block will be skipped or not depending on whether the searched value appears in the block. The number of rows in each granule is defined by the index_granularity setting of the table. Increasing the granularity would make the index lookup faster, but more data might need to be read because fewer blocks will be skipped.

We also need to estimate the number of tokens in each granule of data. In our case, the number of tokens corresponds to the number of distinct path segments. Then we can use a bloom filter calculator. After fixing the “n” (which is the number of token values), “p” (which is the false positive rate) and “k” (which is the number of hash functions), it would give us the size of the bloom filter.

The index can be created on a column or on an expression if we apply some functions to the column in the query. In our case, searching for HTTP URLs isn’t case sensitive so we have created the index on lowerUTF8(http_url).

The final index creation statement looks something like this:

ADD INDEX IF NOT EXISTS tokenbf_http_url_index lowerUTF8(http_url) TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 4
Result

Index size

The size of the tokenbf_v1  index before compression can be calculated as follows:

Bloom_filter_size x number_of_blocks
Number_of_blocks = number_of_rows / (table_index_granularity * tokenbf_index_granularity)

You can check the size of the index file in the directory of the partition in the file system. The file is named as skp_idx_{index_name}.idx.

In our case, the size of the index on the HTTP URL column is only 0.1% of the disk size of all data in that partition.

Query speed

The query speed depends on two factors: the index lookup and how many blocks can be skipped thanks to the index.

According to our testing, the index lookup time isn’t negligible. For example, it can take up to a few seconds on our data set if the index granularity is set to 1. We decided to set the index granularity to 4 to get the index lookup time down to within a second on our data set.

The number of blocks that can be skipped depends on how frequently the searched data occurs and how it’s distributed in the table. Our calls table is sorted by timestamp, so if the searched call occurs very regularly in almost every block, then we’ll barely see any performance improvement because no data is skipped. On the contrary, if the call matching the query only appears in a few blocks, a very small amount of data needs to be read, which makes the query much faster.

Optimize filtering on an HTTP header

Now that we’ve looked at how to use the ClickHouse Data Skipping Index to optimize query filtering on a simple string tag with high cardinality, let’s examine how to optimize filtering on an HTTP header, which is a more advanced tag, consisting of both a key and a value.

In ClickHouse, key-value pair tags are stored in two Array(LowCardinality(String)) columns. For example, given a call with Accept=application/json  and User-Agent=Chrome headers, we store [Accept, User-Agent] in http_headers.key column and [application/json, Chrome] in http_headers.value column.

When filtering by a key-value pair tag, the key must be specified and we support filtering the value with different operators, such as EQUALS , CONTAINS or STARTS_WITH (e.g., call.http.headers.Accept EQUALS application/json) . This filter is translated into a ClickHouse expression:

arrayExists((k, v) -> lowerUTF8(k) = ‘accept’ AND lowerUTF8(v) = ‘application’, http_headers.key, http_headers.value)
Choose and configure the index

We can add indexes to both the key and the value column. tokenbf_v1 and ngrambf_v1  indexes don’t support Array columns. The bloom_filter  index looks to be the best candidate since it supports array functions, such as IN or has .The limitation of the bloom_filter index is that it only supports filtering values using EQUALS  operator, which matches a complete string.

bloom_filter  index requires fewer configurations. The only parameter false_positive is optional, which defaults to 0.025. Reducing the false positive rate will increase the bloom filter size. Since the filtering on key-value pair tag is also case insensitive, the index is created on the lowercased value expressions:

ADD INDEX bloom_filter_http_headers_key_index arrayMap(v -> lowerUTF8(v), http_headers.key) TYPE bloom_filter GRANULARITY 4,
ADD INDEX bloom_filter_http_headers_value_index arrayMap(v -> lowerUTF8(v), http_headers.value) TYPE bloom_filter GRANULARITY 4,

So that the indexes will be triggered when filtering using expression has(arrayMap((v) -> lowerUTF8(v),http_headers.key),'accept').

When filtering on both key and value, such as call.http.header.accept=application/json ,it would be more efficient to trigger the index on the value column because it has higher cardinality. The index on the key column can be used when filtering only on the key (e.g., if call.http.header.accept  is present).

Our pragmatic ClickHouse rollout

Adding an index can be easily done with the ALTER TABLE ADD INDEX  statement. After the index is added, only new incoming data will get indexed. ClickHouse provides ALTER TABLE [db.]table MATERIALIZE INDEX name IN PARTITION partition_name  statement to rebuild the index in an existing partition. But this process would generate additional load on the cluster, which may degrade the performance of writing and querying data. We decided not to do it and just wait seven days until all our calls data gets indexed.

Try index skipping yourself

We have spent quite some time testing the best configuration for the data-skipping indexes. But once we understand how they work and which one is more adapted to our data and use case, we can easily apply it to many other columns.

The bloom_filter index and its two variants—ngrambf_v1 and tokenbf_v1— all have some limitations. They don’t support filtering with all operators. The performance improvement depends on how frequently the searched data occurred and how it’s spread across the whole data set, so it’s not guaranteed for all queries.

Ultimately, I recommend you try the data-skipping index yourself to improve the performance of your ClickHouse queries, especially since it’s relatively inexpensive to put in place. It only takes a bit more disk space, depending on the configuration, and it could speed up the query by four to five times, depending on the amount of data that can be skipped. But test it to make sure that it works well for your own data. If it works for you—great. If not, pull it back or adjust the configuration. We also hope ClickHouse continuously improves these indexes and provides a means to get more insights into their efficiency, for example, by adding an index lookup time and the number granules dropped in the query log.

How IBM Instana can help optimize the ClickHouse Data Skipping Index

In conclusion, optimizing query performance with the ClickHouse Data Skipping Index can be a complex and time-consuming process. However, with the help of the IBM Instana AI-powered monitoring and observability platform, the process can be simplified and streamlined. IBM Instana provides real-time visibility into the performance of your ClickHouse clusters, including detailed metrics on query execution times and resource utilization. This visibility enables you to identify and resolve performance issues quickly and efficiently and optimize your data infrastructure for maximum efficiency.

Moreover, the IBM Instana automatic root cause analysis and machine learning algorithms can help you identify the root cause of any performance issues related to ClickHouse Data Skipping Index optimization. These features reduce the time and effort required to achieve optimal query performance and enable you to focus on more strategic activities. With IBM Instana, you can take full advantage of the ClickHouse Data Skipping Index, achieve optimal query performance and drive better business outcomes. So, if you want to maximize the performance of your ClickHouse data infrastructure, IBM Instana is the ideal solution for you.

The ClickHouse MergeTree table engine provides a few data skipping indexes, making queries faster by skipping data granules and reducing the amount of data to read from disk. A granule is the smallest indivisible data set that ClickHouse reads when selecting data. ngrambf_v1 and tokenbf_v1  are two interesting indexes using bloom filters for optimizing thefiltering of strings. A bloom filter is a space-efficient probabilistic data structure that allows you to test whether an element is a member of a set.

ngrambf_v1

A string is split into substrings of “n” characters. For example, n=3  ngram (trigram) of hello world’  is [‘hel’, ‘ell’, ‘llo’, lo ‘, ‘o w’ …] . The ngrams of each column value will be stored in the bloom filter.

When searching with a filter column LI‘E 'he’lo',  the string in the filter will also be split into ngrams ['’el‘, '’ll‘, '’lo']  and a lookup is done for each value in the bloom filter. If all the ngram values are present in the bloom filter, we can consider that the searched string is present in the bloom filter.

Functions with a constant argument that’s less than ngram size can’t be used by ngrambf_v1 for query optimization. For example, searching for ‘hi’ will not trigger a ngrambf_v1  index with n=3 . Small “n” allows you to support more searched strings. But small “n” leads to more ngram values, which means more hashing and eventually more false positives. A false positive means reading data that doesn’t contain any rows that match the searched string.

Since false positive matches are possible in bloom filters, the index can’t be used when filtering with negative operators, such as column_name ‘= 'value’ or column_name NOT LIKE ‘%hello%’ .

tokenbf_v1

tokenbf_v1  splits the string into tokens separated by nonalphanumeric characters and stores tokens in the bloom filter. ‘Hello world’ is split into 2 tokens [‘hello’, ‘world’].

In addition to the limitation of not supporting negative operators, the searched string must contain at least a complete token. In the previous example, searching for hel won’t trigger the index.

Once we understand how each index behaves, tokenbf_v1 turns out to be a better fit for indexing HTTP URLs, because HTTP URLs are typically path segments separated by /. Each path segment will be stored as a token. Splitting the URLs into ngrams would lead to many more substrings to store. The index size needs to be larger and the lookup will be less efficient.

Author
Yiquan Zhou Senior Software Engineer, Instana