Clickhouse running diff with grouping
Asked Answered
G

3

7

General Task

A table consists of three columns (time, key, value). The task is to calculate a running difference for each key. So, from input

---------------
| time | key | value |
---------------
| 1    | A   | 4     |
| 2    | B   | 1     |
| 3    | A   | 6     |
| 4    | A   | 7     |
| 5    | B   | 3     |
| 6    | B   | 7     |

it is desired to get

----------------------
| key | value | delta |
----------------------
| A   | 4     |  0    |
| B   | 1     |  0    |
| A   | 6     |  2    |
| A   | 7     |  1    |
| B   | 3     |  2    |
| B   | 7     |  4    |

Approaches

  1. runningDifference function. Works, if the key is fixed. So we can

    select *, runningDifference(value) from
        (SELECT key, value from table where key = 'A' order by time)
    

    Note that subquery is necessary here. This solution suffers when you want to get this for different keys

  2. groupArray.

    select key, groupArray(value) from
        (SELECT key, value from table order by time)
        group by key
    

    So, now we get a key and an array of elements with this key. Good.

    But how to calculate a sliding difference? If we could do that, then ARRAY JOIN would lead us to a result.

    Or we can even zip the array with itself and then apply lambda (we have arrayMap for that) but... we don't have any zip alternative.

Any ideas? Thanks in advance.

Germanophobe answered 15/8, 2018 at 9:43 Comment(0)
R
9

Solution with arrays:

WITH 
   groupArray(value) as time_sorted_vals,
   arrayEnumerate(time_sorted_vals) as indexes,
   arrayMap( i -> time_sorted_vals[i] - time_sorted_vals[i-1], indexes) as running_diffs
SELECT 
   key,
   running_diffs
FROM 
    (SELECT key, value from table order by time)
GROUP by key

Other option (doing sort inside each group separately, which is more optimal in a lot of cases)

WITH
   groupArray( tuple(value,time) ) as val_time_tuples,
   arraySort( x -> x.2, val_time_tuples ) as val_time_tuples_sorted,
   arrayMap( t -> t.1, indexes) as time_sorted_vals,
   arrayEnumerate(time_sorted_vals) as indexes,
   arrayMap( i -> time_sorted_vals[i] - time_sorted_vals[i-1], indexes) as running_diffs
SELECT 
   key,
   running_diffs
FROM 
   time
GROUP by key

and you can apply ARRAY JOIN on the result afterward.

Recourse answered 16/8, 2018 at 9:35 Comment(3)
Hi, I was wondering how would the query look like if in the result we also wanted to include time column, meaning: | time | key | value | delta |?Hypochlorite
This answer was posted years ago, for today Sep 29th, 2021, we can use arrayDifference instead of arrayMap. And we can ARRAY JOIN so that we can get a tabulated result instead of a nested array. That is:Tiedeman
see my posted answer.Tiedeman
T
4

This question was posted years ago, for today, Sep 29th, 2021, we can use arrayDifference instead of arrayMap. And we can ARRAY JOIN so that we can get a tabulated result instead of a nested array.

SELECT key, sorted_time, time_sorted_vals, running_diffs
FROM (
WITH
   groupArray( tuple(value,time) ) as val_time_tuples,
   arraySort( x -> x.2, val_time_tuples ) as val_time_tuples_sorted,
   arrayMap( t -> t.1, val_time_tuples_sorted) as time_sorted_vals,
   arrayMap( t -> t.2, val_time_tuples_sorted) as sorted_time,
   arrayDifference(time_sorted_vals) as running_diffs
SELECT 
   key,
   sorted_time,
   time_sorted_vals,
   running_diffs
FROM 
   table_name
GROUP by key)
ARRAY JOIN sorted_time, time_sorted_vals, running_diffs

The only restriction is that the value column should not be of nullable types.

Tiedeman answered 29/9, 2021 at 1:3 Comment(0)
C
0

Lately I've also encountered the problem and Clickhouse offers function arrayDifference.

WITH 
   groupArray(value) as vals
   arrayDifference(vals) as running_diffs
SELECT 
   key,
   running_diffs
FROM 
    (SELECT key, value from table order by time)
GROUP by key
Cross answered 18/5, 2020 at 9:13 Comment(1)
Hello, yeah it's been a long time since the question was asked. Let me point that your answer doesn't solve the original question: your query prodice an array of diffs per each key, while it is desired to get the overall order persisited, a simple array join would not give you such a result.Germanophobe

© 2022 - 2024 — McMap. All rights reserved.