RANK or ROW_NUMBER in BigQuery over a large dataset - McMap

About

RANK or ROW_NUMBER in BigQuery over a large dataset

Asked 20/10, 2015 at 22:11 Answered 21/10, 2015 at 2:29

Solved google-bigquery row-number large-data

H

1

4

I need to add row numbers to a large (ca. billion rows) dataset in BigQuery. When I try:

SELECT 
  *
  ROW_NUMBER() OVER (ORDER BY d_arf DESC) plarf 
FROM [trigram.trigrams8]

I get "Resources exceeded during query execution.", because an analytic/window function needs to fit in one node.

How can I add row numbers to a large dataset in BigQuery?

Heave answered 20/10, 2015 at 22:11 Comment(0)

A

7

You didn't give me a working query, so I had to create my own, so you'll need to translate it to your own problem space. Also I'm not sure why do you want to give a row number to each row in such a huge dataset, but challenge accepted:

SELECT a.enc, plarf, plarf+COALESCE(INTEGER(sumc), (0)) row_num
FROM (
  SELECT STRING(year)+STRING(month)+STRING(mother_age)+state enc, 
         ROW_NUMBER() OVER (PARTITION BY year ORDER BY enc) plarf,
         year
  FROM [publicdata:samples.natality] ) a
LEFT JOIN (
  SELECT COUNT(*) c, year+1 year, SUM(c) OVER(ORDER BY year) sumc
  FROM [publicdata:samples.natality] 
  GROUP BY year
) b
ON a.year=b.year

I want to do a ROW_NUMBER() OVER(), but I can't because there are too many elements.
Having an OVER(PARTITION) fixes this issue, but now each partition starts with 1.
But that's OK. On another subquery I will count how many elements are there in each partition.
And the surrounding query will take the row_number of each partition, and add it to the local-to-the-partition count.
Ta da.

Augusto answered 21/10, 2015 at 2:29 Comment(2)

Wow, neat solution, thanks! I wanted to be able to quickly poke at certain parts of the data when ordered by one column. This seems to solve it. – Heave 24/10, 2015 at 17:51

I like this approach although one problem I have with this query is that over a large dataset partitioned by territory (say) is that the generated row_num is liable to generate duplicates whenever the partitions are of similar sizes. The data I am working with is a daily data set which does not allow for partitioning by any date components. – Cockscomb 3/1, 2018 at 17:15

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.