Google BigQuery APPROX_QUANTILES and getting true quartiles
Asked Answered
F

1

9

According to the docs:

Returns the approximate boundaries for a group of expression values, where number represents the number of quantiles to create. This function returns an array of number + 1 elements, where the first element is the approximate minimum and the last element is the approximate maximum.

It sounds like if I want true quartiles, I need to use APPROX_QUANTILES(values, 4) which will return [minvalue, 1st quartile, 2nd quartile, 3rd quartile, maxvalue]

As according to https://en.wikipedia.org/wiki/Quartile, quartile sets contain 3 data points - none of which is the min/max values of the data.

Is my assumption correct? Is APPROX_QUANTILES(values, 4) going to return the true quartiles?

Falciform answered 18/1, 2018 at 17:10 Comment(2)
What is your question? Is it how to return an array with the first element removed?Dinger
Edited to make the question more clear - will I get the quartiles with 4?Falciform
D
11

As a baseline, this is the output without any modification, using an input of numbers between 1 and 100:

SELECT APPROX_QUANTILES(x, 4) AS output
FROM UNNEST(GENERATE_ARRAY(1, 100)) AS x;
+----------------------------+
|           output           |
+----------------------------+
| ["1","25","50","75","100"] |
+----------------------------+

The output includes both the minimum (1) and the maximum (100). If you just want the quartiles, you need to strip them from the array. For the sake of readability/composability, it's best to do this using a temporary SQL UDF. Here I'm using INT64 for the element type, but you could have a different element type, alternatively:

CREATE TEMP FUNCTION StripFirstLast(arr ARRAY<INT64>) AS (
  ARRAY(SELECT x FROM UNNEST(arr) AS x WITH OFFSET
  WHERE OFFSET BETWEEN 1 AND ARRAY_LENGTH(arr) - 2)
);

SELECT
  APPROX_QUANTILES(x, 4) AS output,
  StripFirstLast(APPROX_QUANTILES(x, 4)) AS quartiles
FROM UNNEST(GENERATE_ARRAY(1, 100)) AS x;
+----------------------------+------------------+
|           output           |    quartiles     |
+----------------------------+------------------+
| ["1","25","50","75","100"] | ["25","50","75"] |
+----------------------------+------------------+

You can see that the quartiles array contains only the desired values.

Dinger answered 18/1, 2018 at 19:37 Comment(1)
Thank you for putting that example together, it showcases the logic perfectly.Falciform

© 2022 - 2024 — McMap. All rights reserved.