How to create dummy variable columns for thousands of categories in Google BigQuery?

About

Asked 30/11, 2015 at 23:34 Answered 1/12, 2015 at 0:55

mysql sql google-bigquery dummy-variable

I have a simple table with 2 columns: UserID and Category, and each UserID can repeat with a few categories, like so:

UserID   Category
------   --------
1         A
1         B
2         C
3         A
3         C
3         B

I want to "dummify" this table: i.e. to create an output table that has a unique column for each Category consisting of dummy variables (0/1 depending on whether the UserID belongs to that particular Category):

UserID    A  B  C
------    -- -- --
1         1  1  0
2         0  0  1
3         1  1  1

My problem is that I have THOUSANDS of categories (not just 3 as in this example) and so this cannot be efficiently accomplished using CASE WHEN statement.

So my questions are:

1) Is there a way to "dummify" the Category column in Google BigQuery without using thousands of CASE WHEN statements.

2) Is this a situation where the UDF functionality works well? It seems like it would be the case but I am not familiar enough with UDF in BigQuery to solve this problem. Would someone be able to help out?

Thanks.

Stuckey answered 30/11, 2015 at 23:34 Comment(2)

It might not be a good idea to get thousands of columns in single BigQuery table. The hard limit is 10,000 columns, but it's best to to get to thousands. What is the original problem you are solving ? – Echoism 1/12, 2015 at 0:5

@MoshaPasumansky I am fine with 10,000 categories. I'm dealing with demographic categories here (~5000 unique categories). I want/need to do it in BigQuery because if I were to dummify (after downloading the table from BigQuery) in Python/R and I have millions of rows (UserID), then it would either take an intolerably long time or crashes. So would you happen to know how to solve this in BigQuery? – Stuckey 1/12, 2015 at 0:16

You can use below "technic"

First run query #1. It produces the query (query #2) that you need to run to get result you need. Please, still consider Mosha's comments before going "wild" with thousands categories :o)

Query #1:

SELECT 'select UserID, ' + 
   GROUP_CONCAT_UNQUOTED(
    'sum(if(category = "' + STRING(category) + '", 1, 0)) as ' + STRING(category)
   ) 
   + ' from YourTable group by UserID'
FROM (
  SELECT category 
  FROM YourTable  
  GROUP BY category
)

Resulted will be like below - Query #2

SELECT
  UserID,
  SUM(IF(category = "A", 1, 0)) AS A,
  SUM(IF(category = "B", 1, 0)) AS B,
  SUM(IF(category = "C", 1, 0)) AS C
FROM
  YourTable
GROUP BY
  UserID

of course for three categories - you could do it manually, but for thousands it will definitelly will make day for you!!

Result of query #2 will looks as you expect:

UserID  A   B   C    
1       1   1   0    
2       0   0   1    
3       1   1   1

Archimedes answered 1/12, 2015 at 0:55 Comment(15)

This is exactly what I'm trying to avoid - having to write thousands of these lines - either CASE WHEN or SUM IF! – Stuckey 1/12, 2015 at 1:5

read again my answer - you DO NOT need to write it manually - just run QUERY #1 and it will produce it for you! – Archimedes 1/12, 2015 at 1:6

Sorry about that! Let me see if I can use this to make it work. @MikhailBerlyant. – Stuckey 1/12, 2015 at 1:9

Sorry for the misunderstanding earlier. I see what you are doing here. Quite clever. But basically I'd still need to paste in a query (result from your query #1) with thousands of lines. I was hoping for a more "elegant"/efficient way so that it can be done in a single query (with UDFs for example) without having to paste in thousands of lines which can be easier for someone else to read and reproduce. @MikhailBerlyant – Stuckey 1/12, 2015 at 1:38

i doubt there is a single query approach for this particular one and still think that in your case above approach is relatively clean and simple for anyone to use - run-copy-paste-run. it does exactly what you asked for in your question :o) – Archimedes 1/12, 2015 at 1:42

Thanks again for your answer Mikhail. Again, it's quite clever. I hazard a guess that with UDF, it might be possible to do dummify using a single query. But I'm still waiting for someone who knows UDF in BigQuery to either confirm or disabuse me of this notion. @MikhailBerlyant. – Stuckey 1/12, 2015 at 2:34

Sure. I am using and quite familiar with BigQuery UDF so will try also to check it again for this task. tomorrow :o) – Archimedes 1/12, 2015 at 2:37

Thanks alot Mikhail!. @MikhailBerylant. – Stuckey 1/12, 2015 at 2:41

@Stuckey - from your comment in #34798744 I have realised you were not able to implement above suggestion because of "Resources exceeded during query execution" error. Please see #34846197 for my detailed recommendations – Archimedes 18/1, 2016 at 1:29

Hi Mikhail Berlyant, in your example, the value of each entry is either 1 or 0, in my case, I have to convert a list of weighted edges (user-item rating) into a adjacency matrix, can you give me an example of how to do that? – Preece 15/6, 2016 at 1:5

Hi @CharlesChow, this is actualy not my example but rather example that is presented in question. I would recommend you to post your specific question along with your example and I (or someone else) will try to answer it – Archimedes 15/6, 2016 at 1:11

Hi Mikhail, I posted my question here: #37825642 thanks – Preece 15/6, 2016 at 2:32

Is this deprecated in Bigquery now? I can't use + between strings. – Ostensory 2/12, 2019 at 5:57

@ArelLin - this answer was for BigQuery Legacy SQL. For Standard SQL you should use CONCAT() – Archimedes 2/12, 2019 at 14:18

@MikhailBerlyant For StandardSQL, what do you mean by you should use CONCAT(), is it as simple as replacing GROUP_CONCAT_UNQUOTED with CONCAT and replacing all the + with a comma, and removing STRING() functions. Would be great if you could update for StandardSQL or at least clarify this. – Yockey 9/1, 2021 at 9:47

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags