How to create dummy variable columns for thousands of categories in Google BigQuery?
Asked Answered
S

1

13

I have a simple table with 2 columns: UserID and Category, and each UserID can repeat with a few categories, like so:

UserID   Category
------   --------
1         A
1         B
2         C
3         A
3         C
3         B

I want to "dummify" this table: i.e. to create an output table that has a unique column for each Category consisting of dummy variables (0/1 depending on whether the UserID belongs to that particular Category):

UserID    A  B  C
------    -- -- --
1         1  1  0
2         0  0  1
3         1  1  1

My problem is that I have THOUSANDS of categories (not just 3 as in this example) and so this cannot be efficiently accomplished using CASE WHEN statement.

So my questions are:

1) Is there a way to "dummify" the Category column in Google BigQuery without using thousands of CASE WHEN statements.

2) Is this a situation where the UDF functionality works well? It seems like it would be the case but I am not familiar enough with UDF in BigQuery to solve this problem. Would someone be able to help out?

Thanks.

Stuckey answered 30/11, 2015 at 23:34 Comment(2)
It might not be a good idea to get thousands of columns in single BigQuery table. The hard limit is 10,000 columns, but it's best to to get to thousands. What is the original problem you are solving ?Echoism
@MoshaPasumansky I am fine with 10,000 categories. I'm dealing with demographic categories here (~5000 unique categories). I want/need to do it in BigQuery because if I were to dummify (after downloading the table from BigQuery) in Python/R and I have millions of rows (UserID), then it would either take an intolerably long time or crashes. So would you happen to know how to solve this in BigQuery?Stuckey
A
10

You can use below "technic"

First run query #1. It produces the query (query #2) that you need to run to get result you need. Please, still consider Mosha's comments before going "wild" with thousands categories :o)

Query #1:

SELECT 'select UserID, ' + 
   GROUP_CONCAT_UNQUOTED(
    'sum(if(category = "' + STRING(category) + '", 1, 0)) as ' + STRING(category)
   ) 
   + ' from YourTable group by UserID'
FROM (
  SELECT category 
  FROM YourTable  
  GROUP BY category
)

Resulted will be like below - Query #2

SELECT
  UserID,
  SUM(IF(category = "A", 1, 0)) AS A,
  SUM(IF(category = "B", 1, 0)) AS B,
  SUM(IF(category = "C", 1, 0)) AS C
FROM
  YourTable
GROUP BY
  UserID

of course for three categories - you could do it manually, but for thousands it will definitelly will make day for you!!

Result of query #2 will looks as you expect:

UserID  A   B   C    
1       1   1   0    
2       0   0   1    
3       1   1   1    
Archimedes answered 1/12, 2015 at 0:55 Comment(15)
This is exactly what I'm trying to avoid - having to write thousands of these lines - either CASE WHEN or SUM IF!Stuckey
read again my answer - you DO NOT need to write it manually - just run QUERY #1 and it will produce it for you!Archimedes
Sorry about that! Let me see if I can use this to make it work. @MikhailBerlyant.Stuckey
Sorry for the misunderstanding earlier. I see what you are doing here. Quite clever. But basically I'd still need to paste in a query (result from your query #1) with thousands of lines. I was hoping for a more "elegant"/efficient way so that it can be done in a single query (with UDFs for example) without having to paste in thousands of lines which can be easier for someone else to read and reproduce. @MikhailBerlyantStuckey
i doubt there is a single query approach for this particular one and still think that in your case above approach is relatively clean and simple for anyone to use - run-copy-paste-run. it does exactly what you asked for in your question :o)Archimedes
Thanks again for your answer Mikhail. Again, it's quite clever. I hazard a guess that with UDF, it might be possible to do dummify using a single query. But I'm still waiting for someone who knows UDF in BigQuery to either confirm or disabuse me of this notion. @MikhailBerlyant.Stuckey
Sure. I am using and quite familiar with BigQuery UDF so will try also to check it again for this task. tomorrow :o)Archimedes
Thanks alot Mikhail!. @MikhailBerylant.Stuckey
@Stuckey - from your comment in #34798744 I have realised you were not able to implement above suggestion because of "Resources exceeded during query execution" error. Please see #34846197 for my detailed recommendationsArchimedes
Hi Mikhail Berlyant, in your example, the value of each entry is either 1 or 0, in my case, I have to convert a list of weighted edges (user-item rating) into a adjacency matrix, can you give me an example of how to do that?Preece
Hi @CharlesChow, this is actualy not my example but rather example that is presented in question. I would recommend you to post your specific question along with your example and I (or someone else) will try to answer itArchimedes
Hi Mikhail, I posted my question here: #37825642 thanksPreece
Is this deprecated in Bigquery now? I can't use + between strings.Ostensory
@ArelLin - this answer was for BigQuery Legacy SQL. For Standard SQL you should use CONCAT()Archimedes
@MikhailBerlyant For StandardSQL, what do you mean by you should use CONCAT(), is it as simple as replacing GROUP_CONCAT_UNQUOTED with CONCAT and replacing all the + with a comma, and removing STRING() functions. Would be great if you could update for StandardSQL or at least clarify this.Yockey

© 2022 - 2024 — McMap. All rights reserved.