ClickHouse: How to store JSON data the right way?

Asked 30/9, 2020 at 6:19 Answered 17/4, 2022 at 1:1

Solved sql clickhouse yandex yandex-metrika

I'm going to migrate data from PostgreSQL database to Yandex's ClickHouse. One of the fields in a source table is of type JSON - called additional_data. So, PostgreSQL allows me to access json attributes during e.g. SELECT ... queries with ->> and -> and so on.

I need the same behavior to persist in my resulting table in ClickHouse storage. (i.e. the ability to parse JSON during select queries and/or when using filtering and aggregation clauses)

Here is what I've done during CREATE TABLE ... in ClickHouse client:

create table if not exists analytics.events
(
    uuid UUID,
    ...,
    created_at DateTime,
    updated_at DateTime,
    additional_data Nested (
        message Nullable(String),
        eventValue Nullable(String),
        rating Nullable(String),
        focalLength Nullable(Float64)
        )
)
engine = MergeTree

ORDER BY (uuid, created_at)
PRIMARY KEY uuid;

Is that a good choice how to store JSON-serializable data? Any Ideas?

Maybe It's better to store a JSON data as a plain String instead of Nested and playing with It using special functions?

Merited answered 30/9, 2020 at 6:19 Comment(2)

Does the structure of json-document is fixed and won't be changed? – Waken 30/9, 2020 at 6:57

@Waken It can be altered with some new properties in future. – Merited 30/9, 2020 at 7:24

Although ClickHouse uses the fast JSON libraries (such as simdjson and rapidjson) to parsing I think the Nesting-fields should be faster.
If the JSON structure is fixed or be changed predictably try to consider the way of denormalizing data:

..
    created_at DateTime,
    updated_at DateTime,
    additional_data_message Nullable(String),
    additional_data_eventValue Nullable(String),
    additional_data_rating Nullable(String),
    additional_data_focalLength Nullable(Float64)
..

On one hand, it can significantly increase the count of rows and disk space, on another side, it should give a significant increase in performance (especially in the right indexing). Moreover, the disk size can be reduced using LowCardinality-type and Codecs.

Some others remarks:

avoid to use Nullable types, prefer to use some replacement such as '', 0, etc (see explanation Clickhouse string field disk usage: null vs empty)
UUID type doesn't give index monotonicity, this one should be much better (More secrets of ClickHouse Query Performance):

..
ORDER BY (created_at, uuid);

consider using Aggregating-engines to significantly increase the speed of calculation aggregated values

In any case before making a final decision need to do manual testing on a data subset (this applies as to choose the schema (json as string/Nested type/denormalized way), as choosing the column codec).

Waken answered 30/9, 2020 at 7:51 Comment(0)

As of 22.3, there is a JSON data type https://clickhouse.com/blog/clickhouse-newsletter-april-2022-json-json-json/

Darya answered 17/4, 2022 at 1:1 Comment(1)

Although this does require the allow_experimental_object_type to be SET. Can be done by SET allow_experimental_object_type = 1; – Minutia 25/7, 2023 at 23:30

Recommended topics

Hot tags