How to delete duplicate rows in SQL ( Clickhouse)?
Asked Answered
K

2

7

so I created a table using clickhouse, but it has duplicates in it.

The following query gives me the duplicates in my table

select *, count() AS cnt from my_table   GROUP BY *
HAVING cnt > 1 

In clickhouse apparently you need do this through altering the table: https://clickhouse.com/docs/en/sql-reference/statements/alter/delete/

so, I tried the following:

ALTER TABLE my_table DELETE WHERE (select *, count() AS cnt from my_table  GROUP BY *
HAVING cnt > 1 ); 

But I am getting the following error:

Exception: The argument of function isZeroOrNull must have simple numeric type, possibly Nullable:

Anyone came across this issue before with clickhouse?

In this video, they explicitly mention clickhouse is not the best with such operations : https://www.youtube.com/watch?v=FsVrFbcyb84&t=1865s

But I am wondering if someone figured a solution

Komsomolsk answered 31/3, 2022 at 3:31 Comment(0)
J
15

First of all, the answer depends on the table engine you used. The most common on ClickHouse is the MergeTree family.

If you use any MergeTree family tables, MaterializedView or Buffer engines, you can use an OPTIMIZE query:

OPTIMIZE TABLE table DEDUPLICATE BY name -- you can put any expression here

https://clickhouse.com/docs/en/sql-reference/statements/optimize/

Before you consider the above query as the answer, you must understand why and why it's not the right way to do it.

In Clickhouse it's normal to have multiple lines for the same primary key, un-like most DB engine, there is no check at all when inserting a line. This allow very fast insertion in tables.

The name "MergeTree" is not here for nothing, in fact the tables are "OPTIMIZED" automatically when Clickhouse thinks its necessary or/and if it have the time for.

What means OPTIMIZE in ClickHouse ? This operation just force the table to merge it's data. Depending on how you build your table. ClickHouse will look for duplicated line, based on your settings and apply the function you asked for.

Two example :

  • ReplacingMergeTree, here the optional parameter is set to datetime, and give the hint to ClickHouse which line is the most recent. Then on duplicates, the most recent is kept over the others.
create table radios
(
    id                UInt64,
    datetime          DateTime,
    name              Nullable(String) default NULL
)
    engine = ReplicatedReplacingMergeTree(datetime)
    ORDER BY id -- it's the primary key
-- example
INSERT INTO radios VALUES (1, now(), 'Some name'), (1, now(), 'New name')
-- after merging:
id,              datetime,       name
 1, '2022-04-04 15:15:00', 'New name'
  • AggregatingMergeTree, here a function is applied the compute the final line. This is what you will find the closest to a UPDATE statement.
create table radio_data
(
    datetime                        DateTime,
    id                              UInt64,
    power                           SimpleAggregateFunction(anyLast, Nullable(Float64)) default NULL,
    access                          SimpleAggregateFunction(sum, Nullable(UInt64))    default NULL
)
    engine = ReplicatedAggregatingMergeTree()
        ORDER BY (id, datetime) -- the primary key

-- example
INSERT INTO radio_data VALUES ('2022-04-04 15:15:00', 1, NULL, 1), ('2022-04-04 15:15:00', 1, 12, 2)
-- will give after merging :
datetime           , id, power, access
2022-04-04 15:15:00,  1,    12,      3

The table you choose, the functions you choose, must be really close to what you finally want to do with you data. Do you replace all the line on update ? Then ReplacingMergeTree is the best, do you update partially a line and apply some function on it ? Then AggregatingMergeTree is the best... ect.

This said, you will have some cases where you need to have your data "fresh" and not duplicated. When your table is well configured, a simple OPTIMIZE TABLE ... is enough. BUT this is expensive, and must be done smartly if you don't want to ruins your server performance. You can also merge the data on the fly, but again, this is expensive and must be done a small subset of data, otherwise it's better to do an OPTIMIZE.

SELECT * FROM radio_data FINAL WHERE id = 1

For instance, we do an OPTIMIZE on all the un-merged partition that are "in the past", for example on the previous day. The goal is to do it the least as possible OPTOIMIZE operation.

My last words will be on the usage of ALTER TABLE statement. It allows DELETE and UPDATE. But they are mutations (https://clickhouse.com/docs/en/sql-reference/statements/alter/#mutations) and are not synchronous ! Don't rely on them if you require fresh data.

You can find more material here :

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree https://clickhouse.com/docs/en/sql-reference/statements/optimize/ https://clickhouse.com/docs/en/sql-reference/statements/alter/

Joijoice answered 4/4, 2022 at 13:43 Comment(0)
C
0

This works for me (on MergeTree engine)

OPTIMIZE TABLE <table> [on cluster ...] final DEDUPLICATE BY <partition by cols ...>, <order by cols>
Criminality answered 19/9, 2023 at 18:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.