Clickhouse: how to use `Data Skipping Indexes` and `Manipulations With Data Skipping Indices` features in clickhouse?
Asked Answered
A

2

7

I'm using the Data Skipping Indexes feature in clickhouse and i got confused about its usage. If i add a data skip index when i create the table like this:

CREATE TABLE MyTable
(
    ...
    INDEX index_time TimeStamp TYPE minmax GRANULARITY 1
)
ENGINE =MergeTree()
...

When i query with TimeStamp filter condition the 'index_time' works. But if i didn't add index when creating table, alternatively, i added the index with Manipulations With Data Skipping Indicesfeature like this:

ALTER TABLEE MyTable ADD INDEX index_time TimeStamp TYPE minmax GRANULARITY 1

Then the index 'index_time' doesn't work.

My database is running on production so i can't recreate the table. I have to use the second way. Can anyone explain why it does not work or i used the feature in a wrong way?

Alost answered 29/8, 2019 at 10:54 Comment(1)
i found the solution. check this github.com/yandex/ClickHouse/issues/6561Alost
M
7

The reason your queries don't use the index after an ALTER TABLE ADD INDEX is because the index does not exist yet. (!)

Any new data will be properly indexed, which is why your index works when you put it in CREATE TABLE. ClickHouse builds the index as you load data. If you created the table, ran ALTER TABLE ADD INDEX, and loaded data you would see the same behavior.

When the data already exist, things are different. ALTER TABLE updates the metadata for the table, but at this point all your data have been written to parts in the table. ClickHouse does not rewrite parts automatically to implement new indexes. However, you should be able to force rewriting to include the index by running:

OPTIMIZE TABLE MyTable FINAL 

See the Github issue https://github.com/yandex/ClickHouse/issues/6561 referenced by Ruijang for more information.

Maraschino answered 3/9, 2019 at 5:43 Comment(0)
S
4

It's quite right that

OPTIMIZE TABLE my_table_name FINAL;

does recreate the indexes set to the table. But there's some scenarios in a columnar DB where you want to avoid rewriting EVERYTHING. If you just add a single index to an already existing table with lots of data when you just rebuilt the new index which includes two steps:

Step 1 - Define the index

Creating the INDEX itself just defines what the index should do, which reflects in Clickhouse as metadata that's added to the table. Thus there is no index build up really, thus nothing will be faster. It's also a lightweight operation as it won't change data or build up any structures beside the table metadata.

It's important to understand any new incoming data will be indexed on insert, but any pre existing data is not included!

ALTER TABLE my_table_name ADD INDEX my_index(my_expression) TYPE minmax GRANULARITY 1

Note Clickhouse can index expressions, so it could simply be the column name as in the question or a more complex expression (e.g. my_index(price * sold_items * revshare)). The index will work on that expression only of course.

Step 2 - Build up (materialize) the index

After creation of the metadata the index for existing data need to be build up. This action is called materialize and needs to be explicitly triggered. Good thing is you can do this individually for any index that was added or changed. This is a heavy operation as it'll trigger work on the database.

ALTER TABLE my_table_name MATERIALIZE INDEX my_index;

Also have a look at the Clickhouse docs for Manipulating Data Skipping Indices

Suck answered 19/1, 2022 at 10:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.