percentile functions with GROUPBY in BigQuery
Asked Answered
C

3

32

In my CENSUS table, I'd like to group by State, and for each State get the median county population and the number of counties.

In psql, redshift, and snowflake, I can do this:

psql=> SELECT state, count(county), PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "population2000") AS median FROM CENSUS GROUP BY state;
        state         | count |  median
----------------------+-------+----------
 Alabama              |    67 |    36583
 Alaska               |    24 |   7296.5
 Arizona              |    15 |   116320
 Arkansas             |    75 |    20229
...

I'm trying to find a nice way to do this in standard BigQuery. I've noticed that there's undocumented percentile_cont analytical function available, but I have to do some major hacks to get it to do what I want.

I'd like to be able to do the same sort thing with what I've gathered are the correct arguments:

SELECT
  state,
  COUNT(county),
  PERCENTILE_CONT(population2000,
    0.5) OVER () AS `medPop`
FROM
  CENSUS
GROUP BY
  state;

but this query yields the error

SELECT list expression references column population2000 which is neither grouped nor aggregated at

I can get the answer I want, but I'd be very disappointed if this is the recommended way to do what I want to do:

SELECT
  MAX(nCounties) AS nCounties,
  state,
  MAX(medPop) AS medPop
FROM (
  SELECT
    nCounties,
    T1.state,
    (PERCENTILE_CONT(population2000,
        0.5) OVER (PARTITION BY T1.state)) AS `medPop`
  FROM
    census T1
  LEFT OUTER JOIN (
    SELECT
      COUNT(county) AS `nCounties`,
      state
    FROM
      census
    GROUP BY
      state) T2
  ON
    T1.state = T2.state) T3
GROUP BY
  state

Is there a better way to do what I want to do? Also, is the PERCENTILE_CONT function ever going to be documented?

Thanks for reading!

Catalepsy answered 9/8, 2017 at 0:2 Comment(0)
C
43

Thanks for your interest. PERCENTILE_CONT is under development, and we will publish the documentation once it is GA. We will support it as analytic function first, and we plan to support it as aggregate function (allowing GROUP BY) later. Between these 2 releases, a simpler workaround would be

SELECT
  state,
  ANY_VALUE(nCounties) AS nCounties,
  ANY_VALUE(medPop) AS medPop
FROM (
  SELECT
    state,
    COUNT(county) OVER (PARTITION BY state) AS nCounties,
    PERCENTILE_CONT(population2000,
      0.5) OVER (PARTITION BY state) AS medPop
  FROM
    CENSUS)
GROUP BY
  state
Cassandry answered 9/8, 2017 at 0:43 Comment(5)
Update: We have published the documentation at cloud.google.com/bigquery/docs/reference/standard-sql/….Cassandry
any update on adding PERCENTILE_CONT as an aggregate function?Nail
plus one to using percentile_cont as aggregate function, similar to quantiles in legacy sqlLaaland
@MingyuZhong is there any progress on this?Adultery
@MingyuZhong will it apply if we don't use the partition by over on the count aggregation and put it outside the nested select. As it will make it clearer if we do the count(nCounties) in the parent select with the group by clause. Just a suggestion. ThanksChu
J
1

It is 2023 and this works rather nicely:

Select p98 of measure grouped by id from thetable:

  select 
    id,
    approx_quantiles(measure, 100)[offset(98)] as p98
  from thetable
  group by id

I didn't notice any signifficant increase in Slot time either.

Johnathanjohnathon answered 31/10, 2023 at 12:55 Comment(0)
A
0

Adding a minor clarification and a subtle difference between 2 flavours of percentile calculation.

Please notice the difference between a continuous percentile and discrete percentile. A continuous percentile can have values that could an interpolation of the real data.

https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#percentile_cont

APPROX_QUANTILE method will not give you continuous values. and PERCENTILE_CONT can give you continuous values but it still does not work with GROUP BY or without a window function as on 2024 - Q2.

++good to know fact: Hive 3's PERCENTILE() function is actually continuous percentile.

BQ example:

WITH cte1 AS (
    SELECT 'vs' col1, 0 AS col2 
    UNION ALL
    SELECT 'vs' col1, 18 AS col2
),
cte2 AS (
    SELECT *, PERCENTILE_CONT(col2, 0.5) OVER(PARTITION BY col1) AS continuous_ntile FROM cte1
)
SELECT 
col1,
ANY_VALUE(continuous_ntile) AS continuous_ntile,
APPROX_QUANTILES(col2, 100)[SAFE_OFFSET(50)] AS discrete_ntile
FROM cte2 
GROUP BY col1

Result:

col1    continuous_ntile    discrete_ntile
vs      9.0                 0
Antisyphilitic answered 29/7, 2024 at 10:30 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.