Big query distinct on and group by
Asked Answered
H

2

10

Following on from Select first row in each GROUP BY group? I am trying to do a very similar thing in Google big query.

Dataset: fh-bigquery:reddit_comments.2018_01

Aim: For each link_id (Reddit submission) select the first comment in terms of created_utc

SELECT body,link_id 
FROM [fh-bigquery:reddit_comments.2018_01] 
where subreddit_id == "t5_2zkvo"  
group by  link_id ,body, created_utc  
order by link_id ,body,  created_utc desc 

At the moment it's not working, because it's still not giving me unique/distinct parent_id(s)

Please and thank you!


EDIT: I was incorrect when saying that parent_id is == to submission, it's actually link_id

Hiss answered 15/6, 2018 at 9:49 Comment(0)
G
6

We can use ROW_NUMBER() here:

SELECT body, parent_id, created_utc
FROM
(
    SELECT *, ROW_NUMBER() OVER (PARTITION BY parent_id ORDER BY created_utc) rn
    FROM [fh-bigquery:reddit_comments.2018_01]
    WHERE subreddit_id = 't5_2zkvo'
) t
WHERE rn = 1
ORDER BY parent_id ,body, created_utc DESC;

Note that you could continue with your current approach, but then you would have to phrase the query as a join between your table and a subquery which finds the earliest entry for each comment:

SELECT t1.*
FROM [fh-bigquery:reddit_comments.2018_01] t1
INNER JOIN
(
    SELECT parent_id, MIN(created_utc) AS first_created_utc
    FROM [fh-bigquery:reddit_comments.2018_01]
    GROUP BY parent_id
) t2
    ON t1.parent_id = t2.parent_id AND t1.created_utc = t2.first_created_utc;
Galactose answered 15/6, 2018 at 9:57 Comment(0)
S
10

Below is for BigQuery Standard SQL

#standardSQL
SELECT 
  ARRAY_AGG(body ORDER BY created_utc LIMIT 1)[OFFSET(0)] body, 
  link_id
FROM `fh-bigquery.reddit_comments.2018_01`
WHERE subreddit_id = 't5_2zkvo'
GROUP BY link_id
-- ORDER BY link_id
Stalemate answered 15/6, 2018 at 11:12 Comment(0)
G
6

We can use ROW_NUMBER() here:

SELECT body, parent_id, created_utc
FROM
(
    SELECT *, ROW_NUMBER() OVER (PARTITION BY parent_id ORDER BY created_utc) rn
    FROM [fh-bigquery:reddit_comments.2018_01]
    WHERE subreddit_id = 't5_2zkvo'
) t
WHERE rn = 1
ORDER BY parent_id ,body, created_utc DESC;

Note that you could continue with your current approach, but then you would have to phrase the query as a join between your table and a subquery which finds the earliest entry for each comment:

SELECT t1.*
FROM [fh-bigquery:reddit_comments.2018_01] t1
INNER JOIN
(
    SELECT parent_id, MIN(created_utc) AS first_created_utc
    FROM [fh-bigquery:reddit_comments.2018_01]
    GROUP BY parent_id
) t2
    ON t1.parent_id = t2.parent_id AND t1.created_utc = t2.first_created_utc;
Galactose answered 15/6, 2018 at 9:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.