bigquery group by all columns except a few

Asked 20/2, 2019 at 17:42 Answered 6/6, 2024 at 14:31

I have a table with loads of fields, and I am trying to group by all except two values which I am summing on. I would like to do something like

SELECT my_table.* except(value_1, value_2)
    , sum(value_1)
    , sum(value_2)
FROM my_table
GROUP BY my_table.* except(value_1, value_2)

But unfortunately GROUP BY my_table.* except(value_1, value_2) do not work. Any suggestions please?

Pitchblack answered 20/2, 2019 at 17:42 Comment(0)

Below is for BigQuery Standard SQL

#standardSQL
SELECT DISTINCT * EXCEPT(value_1, value_2, grp),
  SUM(value_1) OVER(PARTITION BY grp) sum_value_1,
  SUM(value_2) OVER(PARTITION BY grp) sum_value_2
FROM (
  SELECT *, REGEXP_REPLACE(TO_JSON_STRING(t), r'"(?:value_1|value_2)":.+?[,}]', '') grp
  FROM `project.dataset.table` t
)

You can test, play with above using dummy data as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 value_1, 2 value_2, 3 value_3, 4 value_4 UNION ALL
  SELECT 11, 12, 3, 14 UNION ALL
  SELECT 21, 22, 3, 14
)
SELECT DISTINCT * EXCEPT(value_1, value_2, grp),
  SUM(value_1) OVER(PARTITION BY grp) sum_value_1,
  SUM(value_2) OVER(PARTITION BY grp) sum_value_2
FROM (
  SELECT *, REGEXP_REPLACE(TO_JSON_STRING(t), r'"(?:value_1|value_2)":.+?[,}]', '') grp
  FROM `project.dataset.table` t
)

with result as

Row value_3 value_4 sum_value_1 sum_value_2  
1   3       14      32          34   
2   3       4       1           2

Above will work with any number of columns and you don't need to reference them all explicitly - only those columns to be excluded to be explicitly referenced - value_1 and value_2 in this example

Pascoe answered 20/2, 2019 at 18:5 Comment(4)

This is fantastic! Thanks a lot! Can you explain a little what this part is doing? REGEXP_REPLACE(TO_JSON_STRING(t), r'"(?:value_1|value_2)":.+?[,}]', '') grp – Dobby 24/9, 2019 at 14:9

@Dobby - it convert whole row into json, then empties all values for value_1 and value_2 attributes so then you can use it as attribute for PARTITION BY – Pascoe 24/9, 2019 at 15:43

Are we sure that the order is always the same when serialising rows into JSONs? To me this looks risky since JSON is unordered by definition. – Nomenclator 6/7, 2022 at 13:29

@Nomenclator TO_JSON_STRING returns a string in JSON format and does not 'create a JSON object' in a formal sense: cloud.google.com/bigquery/docs/reference/standard-sql/… – Ephialtes 20/10, 2022 at 21:36

BigQuery has since released the GROUP BY ALL clause (currently in preview):
https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#group_by_all

This allows you to rewrite your original suggestion like so:

SELECT
    my_table.* EXCEPT(value_1, value_2),
    SUM(value_1),
    SUM(value_2),
FROM my_table
GROUP BY ALL

(BigQuery has also supported trailing commas for years now so I rewrote your SELECT slightly)

The clause automatically excludes expressions that include aggregate functions so this works perfectly for your use case.

Suttle answered 6/6, 2024 at 14:31 Comment(0)

Recommended topics

Hot tags