Return Top N rows for each group (Vertica/vsql)
Asked Answered
I

2

6

Familiar question, but with Vertica. I'd like to return the top 5 geo_country rows based on sum(imps) for each tag_id. This is the query I started:

SELECT tag_id,
       geo_country,
       SUM(imps) AS imps,
       RANK() OVER (PARTITION BY tag_id ORDER BY SUM(imps) DESC) AS rank
FROM table1
WHERE tag_id IN (2013150,1981153)
AND ymd > CURRENT_DATE - 3
GROUP BY 1,
         2 LIMIT 10;

This actually returns only rows from the first tag in the WHERE clause (2013150). I know that the other tag has sum(imps) values high enough which should include it in the results.

Also, how do I implement the Top N part? I tried adding a LIMIT clause within the OVER function, but it doesn't look like it is an accepted parameter.

Indoors answered 9/12, 2013 at 20:3 Comment(0)
I
10

Solved. The solution is to convert the query to a subquery and then use the WHERE clause to filter by rank:

SELECT * 
FROM (SELECT tag_id, geo_country, sum(imps),
    RANK() OVER (PARTITION BY tag_id ORDER BY SUM(imps) DESC) AS rank 
    FROM table1
    WHERE tag_id IN (2013150,1981153)
    AND ymd > CURRENT_DATE - 3
    GROUP BY 1,2) as t2
WHERE t2.rank <=5;
Indoors answered 10/12, 2013 at 0:10 Comment(3)
This is the pattern I use every time, but it's so unsatisfying for some reason. I guess I wish there was something akin to HAVING for analytic queries?Shoplifter
Consider accepting your answer to complete the question.Giovanna
@Shoplifter that's not how analytic queries work nor would it follow SQL standards.Giovanna
W
0

I think what is happening here is that the group by orders your data on tag_id and then geo_country. Doing a limit then takes the first 10 records. If there are at least 10 geo_countries for tag_id 1 then you will only see tag_id 1 in your result. Wouldn't sorting on the rank ASC solve your problem.

I'm not sure if using rank in a sort is allowed though in Vertica.

SELECT tag_id,
   geo_country,
   SUM(imps) AS imps,
   RANK() OVER (PARTITION BY tag_id ORDER BY SUM(imps) DESC) AS rank
FROM table1
WHERE tag_id IN (2013150,1981153)
AND ymd > CURRENT_DATE - 3
GROUP BY 1,
         2
ORDER BY 4
LIMIT 10;
Wynnie answered 15/7, 2016 at 9:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.