Sorting within collect_list() in hive
Asked Answered
I

4

8

Let's say I have a hive table that looks like this:

ID    event    order_num
------------------------
A      red         2
A      blue        1
A      yellow      3
B      yellow      2
B      green       1
...

I'm trying to use collect_list to generate a list of events for each ID. So something like the following:

SELECT ID, 
collect_list(event) as events_list,
FROM table
GROUP BY ID;

However, within each of the IDs that I group by, I need to sort by order_num. So that my resulting table would look like this:

ID    events_list
------------------------
A      ["blue","red","yellow"]
B      ["green","red"]

I can't do a global sort by ID and order_num before the collect_list() query because the table is massive. Is there a way to sort by order_num within collect_list?

Thanks!

Ionia answered 8/6, 2018 at 18:47 Comment(2)
sorting by id and order_num and performing a collect_list might be the only way.Abigailabigale
I hope that's not the case! I know in google bigquery, you can do things like this: STRING_AGG(event, "" ORDER BY order_num ASC) AS event_listIonia
I
7

So, I found the answer here. The trick is to use a subquery with a DISTRIBUTE BY and SORT BY statement. See below:

WITH table1 AS (
    SELECT 'A' AS ID, 'red' AS event, 2 AS order_num UNION ALL
    SELECT 'A' AS ID, 'blue' AS event, 1 AS order_num UNION ALL
    SELECT 'A' AS ID, 'yellow' AS event, 3 AS order_num UNION ALL
    SELECT 'B' AS ID, 'yellow' AS event, 2 AS order_num UNION ALL
    SELECT 'B' AS ID, 'green' AS event, 1 AS order_num
)

-- Collect it
SELECT subquery.ID, 
collect_list(subquery.event) as events_list
FROM (
SELECT
        table1.ID,
        table1.event,
        table1.order_num
    FROM table1
    DISTRIBUTE BY
        table1.ID
    SORT BY
        table1.ID,
        table1.order_num
) subquery
GROUP BY subquery.ID;
Ionia answered 9/6, 2018 at 22:4 Comment(1)
This method may not work as you wished in some big data compute engine.Godly
T
5

this my first answer question of stack overflow. but the answer is very very userful.

WITH table1 AS (
    SELECT 'A' AS ID, 'red' AS event, 2 AS order_num UNION ALL
    SELECT 'A' AS ID, 'blue' AS event, 1 AS order_num UNION ALL
    SELECT 'A' AS ID, 'yellow' AS event, 3 AS order_num UNION ALL
    SELECT 'B' AS ID, 'yellow' AS event, 2 AS order_num UNION ALL
    SELECT 'B' AS ID, 'green' AS event, 1 AS order_num
)

select ID 
       ,sort_array(collect_list(struct(order_num, item_score))).col2 as item_list
from (
        select ID
               ,event
               ,order_num
               ,concat(event, ':', order_num) as item_score
        from table1
     ) t0 
group by ID
Truncated answered 1/6, 2022 at 8:0 Comment(2)
It is indeed very useful, and doesn't rely on SORT BY in DISTRIBUTE BY. I think it is the only correct answer here!Ephraimite
I tried this approach, and unfortunately something doesn't work well for me: I get result list where ID is not unique, but it should be because of group by ID. I have no idea why it is not workingEphraimite
H
4

The function sort_array() should sort the collect_list() items

select ID, sort_array(collect_list(event)) as events_list,
from table
group by ID;
Healy answered 8/6, 2018 at 22:9 Comment(1)
But can it sort the array by another column (e.g. the order_num column)? It doesn't seem like it.Ionia
C
1

Try the following:

WITH tmp AS (
  SELECT * FROM data DISTRIBUTE BY ID SORT BY ID, order_num desc
)
SELECT ID, collect_list(event)
FROM tmp
GROUP BY ID
Corroboration answered 7/8, 2019 at 21:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.