How to transpose rows to columns with large amount of the data in BigQuery/SQL?
Asked Answered
S

2

3

I have a problem in transposing a large amount of data table in BigQuery (1.5 billion rows) from rows to columns. I could figure out how to do it with small amount of data when hardcoded, but with this large amount. A snapshot of the table looks like this:

+--------------------------+ | CustomerID Feature Value | +--------------------------+ | 1 A123 3 | | 1 F213 7 | | 1 F231 8 | | 1 B789 9.1 | | 2 A123 4 | | 2 U123 4 | | 2 B789 12 | | .. .. .. | | .. .. .. | | 400000 A123 8 | | 400000 U123 7 | | 400000 R231 6 | +--------------------------+

So basically there are approximately 400,000 distinct customerID with 3000 features, and not every customerID has the same features, so some customerID may have 2000 features while some have 3000. The end result table I would like to get is each row presents one distinct customerID, and with 3000 columns that presents all the features. Like this:

CustomerID Feature1 Feature2 ... Feature3000

So some of the cells may have missing values.

Anyone has idea how to do this in BigQuery or SQL?

Thanks in advance.

Symphonic answered 14/1, 2016 at 19:40 Comment(0)
M
5
STEP #1

In below query replace yourTable with real name of your table and execute/run it

SELECT 'SELECT CustomerID, ' + 
   GROUP_CONCAT_UNQUOTED(
      'MAX(IF(Feature = "' + STRING(Feature) + '", Value, NULL))'
   ) 
   + ' FROM yourTable GROUP BY CustomerID'
FROM (SELECT Feature FROM yourTable GROUP BY Feature) 

As a result you will get some string to be used in next step!

STEP #2

Take string you got from Step 1 and just execute it as a query
The output is a Pivot you asked in question

Monostylous answered 14/1, 2016 at 20:8 Comment(4)
Thanks very much! I have tried and however, when I run the query from Step 2, I got an error saying "Resources exceeded during query execution." I guess it could due to the GROUP BY takes lots of memory. Is there a workaround on this?Symphonic
I would recommend to start with limiting/lowering number of features. You can control this in Step 1 in subqueryMonostylous
If GROUP BY gives you trouble, try GROUP EACH BYSanskrit
@Symphonic - please see #34846197 for more recommendations to address "Resources exceeded during query execution." errorMonostylous
G
0

Hi @Jade I posted a very similar question before. And got a very helpful (and similar) answer from @MikhailBerlyant. For what it's worth, I had about 4000 features to dummify in my case and also ran into "Resources exceeded during query execution" error.

I think that this type of large-scale data transformation (rather than query) is better left for other tools more suitable for this task (such as Spark).

Gina answered 15/1, 2016 at 18:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.