Quartiles in SQL query
Asked Answered
I

6

5

I have a very simple table like that:

CREATE TABLE IF NOT EXISTS LuxLog (
  Sensor TINYINT,
  Lux INT,
  PRIMARY KEY(Sensor)
)

It contains thousands of logs from different sensors.

I would like to have Q1 and Q3 for all sensors.

I can do one query for every data, but it would be better for me to have one query for all sensors (getting Q1 and Q3 back from one query)

I though it would be a fairly simple operation, as quartiles are broadly used and one of the main statistical variables in frequency calculation. The truth is that I found loads of overcomplicated solutions, while I was hoping to find something neat and simple.

Anyone can give me a hint?

Edit: This is a piece of code that I found online, but it is not working for me:

SELECT  SUBSTRING_INDEX(
        SUBSTRING_INDEX(
            GROUP_CONCAT(                 -- 1) make a sorted list of values
                Lux
                ORDER BY Lux
                SEPARATOR ','
            )
        ,   ','                           -- 2) cut at the comma
        ,   75/100 * COUNT(*)        --    at the position beyond the 90% portion
        )
    ,   ','                               -- 3) cut at the comma
    ,   -1                                --    right after the desired list entry
    )                 AS `75th Percentile`
    FROM    LuxLog
    WHERE   Sensor=12
    AND     Lux<>0

I am getting 1 as return value, while it should be a number that can be divided by 10 (10,20,30.....1000)

Incapacity answered 3/7, 2015 at 14:53 Comment(6)
Didn't you ask this question just a few hours ago? No good answers?Oriel
"I though it would be a fairly simple operation, as quartiles are broadly used and one of the main statistical variables in frequency calculation." That's not a very good basis for predicting the ease of a task. Calculating quartiles (and even just medians) is operationally complicated compared to, say, calculating means.Compurgation
You are right, but I am approaching SQL now and since I am used to other high level programming languages the lack of a statistical package is making me suffer.Incapacity
SQL is not well characterized as a programming language. It is a data definition and data manipulation language, based on a relational data model. Although SQL in general and moreso certain implementations do have some features directed at ordering, the underlying model is based on (unordered) sets. This doesn't match up well with certain kinds of tasks, such as computing quartiles. That doesn't mean you can't perform such computations in SQL, but for some tasks you're better off pairing SQL with another language.Compurgation
MySQL in particular happens to be especially limited for such tasks, as it lacks several common features any one of which would make the job easier (window functions, common table expressions, various other specific functions -- even disregarding NTILE()).Compurgation
Ok, but since my choice is between finding a way to query the database and get 1 single value already calculated or getting about 100k values and then compute them myself, I assumed it would be more efficient to avoid the trasmission of 100k over the internet. So, programming language or not, is SQL able to give me quartiles or I need to do it manually?Incapacity
B
5

See SqlFiddle : http://sqlfiddle.com/#!9/accca6/2/6 Note : for the sqlfiddle I've generated 100 rows, each integer between 1 and 100 has a row, but it is a random order (done in excel).

Here is the code :

SET @number_of_rows := (SELECT COUNT(*) FROM LuxLog);
SET @quartile := (ROUND(@number_of_rows*0.25));
SET @sql_q1 := (CONCAT('(SELECT "Q1" AS quartile_name , Lux, Sensor FROM LuxLog ORDER BY Lux DESC LIMIT 1 OFFSET ', @quartile,')'));
SET @sql_q3 := (CONCAT('( SELECT "Q3" AS quartile_name , Lux, Sensor FROM LuxLog ORDER BY Lux ASC LIMIT 1 OFFSET ', @quartile,');'));
SET @sql := (CONCAT(@sql_q1,' UNION ',@sql_q3));
PREPARE stmt1 FROM @sql;
EXECUTE stmt1;

EDIT :

SET @current_sensor := 101;
SET @quartile := (ROUND((SELECT COUNT(*) FROM LuxLog WHERE Sensor = @current_sensor)*0.25));
SET @sql_q1 := (CONCAT('(SELECT "Q1" AS quartile_name , Lux, Sensor FROM LuxLog WHERE Sensor=', @current_sensor,' ORDER BY Lux DESC LIMIT 1 OFFSET ', @quartile,')'));
SET @sql_q3 := (CONCAT('( SELECT "Q3" AS quartile_name , Lux, Sensor FROM LuxLog WHERE Sensor=', @current_sensor,' ORDER BY Lux ASC LIMIT 1 OFFSET ', @quartile,');'));
SET @sql := (CONCAT(@sql_q1,' UNION ',@sql_q3));
PREPARE stmt1 FROM @sql;
EXECUTE stmt1;

Underlying reasoning is as follows : For quartile 1 we want to get 25% from the top so we want to know how much rows there are, that's :

SET @number_of_rows := (SELECT COUNT(*) FROM LuxLog);

Now that we know the number of rows, we want to know what is 25% of that, it is this line :

SET @quartile := (ROUND(@number_of_rows*0.25));

Then to find a quartile we want to order the LuxLog table by Lux, then to get the row number "@quartile", in order to do that we set the OFFSET to @quartile to say that we want to start our select from the row number @quartile and we say limit 1 to say that we want to retrieve only one row. That's :

SET @sql_q1 := (CONCAT('(SELECT "Q1" AS quartile_name , Lux, Sensor FROM LuxLog ORDER BY Lux DESC LIMIT 1 OFFSET ', @quartile,')'));

We do (almost) the same for the other quartile, but rather than starting from the top (from higher values to lower) we start from the bottom (it explains the ASC).

But for now we just have strings stored in the variables @sql_q1 and @sql_q3, so the concatenate them, we union the results of the queries, we prepare the query and execute it.

Bouncing answered 3/7, 2015 at 18:19 Comment(8)
I built the Fiddle (sqlfiddle.com/#!9/a14a4/3) with a small portion of my data (I have about 50k rows per day). I also added 'WHERE Sensor=x' since I need the data for each sensor, but I can't understand how to use Fiddle. Would you like to give it a look?Incapacity
@Incapacity : sorry I thought there were no 2 rows with same sensor ^^ Here is a code that works for a given Sensor (see edited code), I'm working on one that outputs for all sensors, I tell you when it's done.Bouncing
I will try it as soon as I can. Don't bother for the output for all sensors, as I can run different queries for every sensor. The important for me is to get the quartiles from the query instead of getting thousands of raws back from a query and then calculate them myself.Incapacity
Also that is probably my mistake. If I have multiple rows with the same sensor, I shouldn't se the sensor as primary key right?Incapacity
If we take the definition from w3schools for instance (first on Google) : "The PRIMARY KEY constraint uniquely identifies each record in a database table.". It basically means that each row has a different value in the primary key field. It allows you to uniquely identify a row. That is why it is strange to have PRIMARY KEY which is redundant. It is more likely a foreign key, that means that the foreign key field in your table links to a primary key in another table. See : #1693038Bouncing
My table is just a log table used for statistics, I don't really need a primary key, but now that I think about it I don't know if using Sensor as primary key might create problems with the tableIncapacity
Maybe you don't need a primary key BUT you probably need an index (to see the usefulness of an index : #1608 ). It will make the results of your query faster. In addition, you simply cannot use primary index on your database since you have multiple rows which have the same key. It returns an error "Duplicate entry '101' for key 'PRIMARY'" when I tried.Bouncing
I just had the chance to check, and I actually did it good but I made a mistake when I put it here in a simplified way to make things easier. In the real table I have a date column, and I set a primary key this way: 'PRIMARY KEY(Time,Sensor))'; I have a new reading from the sensor every 30s so I should be fine like that. The code is working great, now I just need to figure out how to implement it through C#.Incapacity
A
4

Well to use NTILE is very simple but it is a Postgres Function. You basically just do something like this:

SELECT value_you_are_NTILING,
    NTILE(4) OVER (ORDER BY value_you_are_NTILING DESC) AS tiles
FROM
(SELECT math_that_gives_you_the_value_you_are_NTILING_here AS value_you_are_NTILING FROM tablename);

Here is a simple example I made for you on SQLFiddle: http://sqlfiddle.com/#!15/7f05a/1

In MySQL you would use RANK... Here is the SQLFiddle for that: http://www.sqlfiddle.com/#!2/d5587/1 (this comes from the Question linked below)

This use of MySQL RANK() comes from the Stackoverflow answered here: Rank function in MySQL

Look for the answer by Salman A.

Archenemy answered 3/7, 2015 at 15:11 Comment(5)
Well, the biggest problem with NTILE() for this job is that MySQL doesn't have it (and the question is tagged mysql).Compurgation
Yes, you are right. I am using MySQL and I just noticed that NTILE() is not a MySQL function. I am sorry for wasting your time.Incapacity
It's not a waste of time. I added a link to the discussion of the RANK function in Mysql. That gives you what you are looking for.Archenemy
I tried the example on my table and I got the results grouped with the rank. I guess that in order to get the quartiles I have to ask for rank number totalranks*0.25 and totalranks*0.75?Incapacity
I guess NTILE has been added to MySQL 8Requisition
A
2

Something like this should do it:

select
    ll.*,
    if (a.position is not null, 1,
        if (b.position is not null, 2, 
        if (c.position is not null, 3, 
        if (d.position is not null, 4, 0)))
    ) as quartile
from
    luxlog ll
    left outer join luxlog a on ll.position = a.position and a.lux > (select count(*)*0.00 from luxlog) and a.lux <= (select count(*)*0.25 from luxlog)
    left outer join luxlog b on ll.position = b.position and b.lux > (select count(*)*0.25 from luxlog) and b.lux <= (select count(*)*0.50 from luxlog)
    left outer join luxlog c on ll.position = c.position and c.lux > (select count(*)*0.50 from luxlog) and c.lux <= (select count(*)*0.75 from luxlog)
    left outer join luxlog d on ll.position = d.position and d.lux > (select count(*)*0.75 from luxlog)
;    

Here's the complete example:

use example;

drop table if exists luxlog;

CREATE TABLE LuxLog (
  Sensor TINYINT,
  Lux INT,
  position int,
  PRIMARY KEY(Position)
);

insert into luxlog values (0, 1, 10);
insert into luxlog values (0, 2, 20);
insert into luxlog values (0, 3, 30);
insert into luxlog values (0, 4, 40);
insert into luxlog values (0, 5, 50);
insert into luxlog values (0, 6, 60);
insert into luxlog values (0, 7, 70);
insert into luxlog values (0, 8, 80);

select count(*)*.25 from luxlog;
select count(*)*.50 from luxlog;

select
    ll.*,
    a.position,
    b.position,
    if(
        a.position is not null, 1,
        if (b.position is not null, 2, 0)
    ) as quartile
from
    luxlog ll
    left outer join luxlog a on ll.position = a.position and a.lux >= (select count(*)*0.00 from luxlog) and a.lux < (select count(*)*0.25 from luxlog)
    left outer join luxlog b on ll.position = b.position and b.lux >= (select count(*)*0.25 from luxlog) and b.lux < (select count(*)*0.50 from luxlog)
    left outer join luxlog c on ll.position = c.position and c.lux >= (select count(*)*0.50 from luxlog) and c.lux < (select count(*)*0.75 from luxlog)
    left outer join luxlog d on ll.position = d.position and d.lux >= (select count(*)*0.75 from luxlog) and d.lux < (select count(*)*1.00 from luxlog)
;    


select
    ll.*,
    if (a.position is not null, 1,
        if (b.position is not null, 2, 
        if (c.position is not null, 3, 
        if (d.position is not null, 4, 0)))
    ) as quartile
from
    luxlog ll
    left outer join luxlog a on ll.position = a.position and a.lux > (select count(*)*0.00 from luxlog) and a.lux <= (select count(*)*0.25 from luxlog)
    left outer join luxlog b on ll.position = b.position and b.lux > (select count(*)*0.25 from luxlog) and b.lux <= (select count(*)*0.50 from luxlog)
    left outer join luxlog c on ll.position = c.position and c.lux > (select count(*)*0.50 from luxlog) and c.lux <= (select count(*)*0.75 from luxlog)
    left outer join luxlog d on ll.position = d.position and d.lux > (select count(*)*0.75 from luxlog)
;    
Allo answered 3/7, 2015 at 15:47 Comment(3)
This query took 80 seconds to execute and it gave me the all 10k rows as result. The main reason I am trying to do this is to avoid getting all that data transferred. I expect only the result of the operation back from the select.Incapacity
Edited the rank answer I gave to return just one row for each ntile.Allo
I still get a table with thousands of rows with your query, with 25s for the query with a fraction of the data I will have. I don't understand how those two select count(*) at the beginning should help.Incapacity
A
0

Or you could use rank like this:

select
    ll.*,
    @curRank := @curRank + 1 as rank,
    if (@curRank <= (select count(*)*0.25 from luxlog), 1,
        if (@curRank <= (select count(*)*0.50 from luxlog), 2, 
        if (@curRank <= (select count(*)*0.75 from luxlog), 3, 4))
    ) as quartile
from
    luxlog ll,
    (SELECT @curRank := 0) r
;    

And this will give just one record for each quartile:

select
    x.quartile, group_concat(position)
from (
    select
        ll.*,
        @curRank := @curRank + 1 as rank,
        if (@curRank > 0 and @curRank <= (select count(*)*0.25 from luxlog), 1,
            if (@curRank > 0 and @curRank <= (select count(*)*0.50 from luxlog), 2, 
            if (@curRank > 0 and @curRank <= (select count(*)*0.75 from luxlog), 3, 4))
        ) as quartile
    from
        luxlog ll,
        (SELECT @curRank := 0) r
) x
group by quartile

+ ------------- + --------------------------- +
| quartile      | group_concat(position)      |
+ ------------- + --------------------------- +
| 1             | 10,20                       |
| 2             | 30,40                       |
| 3             | 50,60                       |
| 4             | 70,80                       |
+ ------------- + --------------------------- +
4 rows

EDIT: The sqlFiddle example (http://sqlfiddle.com/#!9/a14a4/17) looks like this after this is removed

/*SET @number_of_rows := (SELECT COUNT(*) FROM LuxLog);
SET @quartile := (ROUND(@number_of_rows*0.25));
SET @sql_q1 := (CONCAT('(SELECT "Q1" AS quartile_name , Lux, Sensor FROM LuxLog WHERE Sensor=101 ORDER BY Lux DESC LIMIT 1 OFFSET ', @quartile,')'));
SET @sql_q3 := (CONCAT('( SELECT "Q3" AS quartile_name , Lux, Sensor FROM LuxLog WHERE Sensor=101 ORDER BY Lux ASC LIMIT 1 OFFSET ', @quartile,');'));
SET @sql := (CONCAT(@sql_q1,' UNION ',@sql_q3));
PREPARE stmt1 FROM @sql;
EXECUTE stmt1;*/

enter image description here

Allo answered 3/7, 2015 at 15:57 Comment(2)
I am trying to test it but it doesn't run. You can see a table with a small sample of my data here: sqlfiddle.com/#!9/a14a4/6Incapacity
If I remove the commented out code (/* */) at the top of the sqlfiddle example you give it works for me.Allo
W
0

Here's a query I came up with for calculating quartiles; it runs in ~0.04s w/ ~5000 table rows. I included the min/max values as I am ultimately using this data to build the four quartile ranges:

   SELECT percentile_table.percentile, avg(ColumnName) AS percentile_values
    FROM   
        (SELECT @rownum := @rownum + 1 AS `row_number`, 
                   d.ColumnName 
            FROM   PercentileTestTable d, 
                   (SELECT @rownum := 0) r 
            WHERE  ColumnName IS NOT NULL 
            ORDER  BY d.ColumnName
        ) AS t1, 
        (SELECT count(*) AS total_rows 
            FROM   PercentileTestTable d 
            WHERE  ColumnName IS NOT NULL 
        ) AS t2, 
        (SELECT 0 AS percentile 
            UNION ALL 
            SELECT 0.25
            UNION ALL 
            SELECT 0.5
            UNION ALL 
            SELECT 0.75
            UNION ALL 
            SELECT 1
        ) AS percentile_table  
    WHERE  
        (percentile_table.percentile != 0 
            AND percentile_table.percentile != 1 
            AND t1.row_number IN 
            ( 
                floor(( total_rows + 1 ) * percentile_table.percentile), 
                floor(( total_rows + 2 ) * percentile_table.percentile)
            ) 
        ) OR (
            percentile_table.percentile = 0 
            AND t1.row_number = 1
        ) OR (
            percentile_table.percentile = 1 
            AND t1.row_number = total_rows
        )
    GROUP BY percentile_table.percentile; 

Fiddle here: http://sqlfiddle.com/#!9/58c0e2/1

There are certainly performance issues; I'd love if anyone has feedback on how to improve this.

Sample data list:

 3, 4, 4, 4, 7, 10, 11, 12, 14, 16, 17, 18

Sample query output:

| percentile | percentile_values |
|------------|-------------------|
|          0 |                 3 |
|       0.25 |                 4 |
|        0.5 |              10.5 |
|       0.75 |                15 |
|          1 |                18 |
Worsen answered 4/5, 2016 at 7:49 Comment(0)
F
0

I use this solution with a MYSQL function :

x is the centile you want

array_values your group_concat values order and separated by ,

DROP FUNCTION IF EXISTS centile;

delimiter $$
CREATE FUNCTION `centile`(x Text, array_values TEXT) RETURNS text
BEGIN

Declare DIFF_RANK TEXT;
Declare RANG_FLOOR INT;
Declare COUNT INT;
Declare VALEUR_SUP TEXT;
Declare VALEUR_INF TEXT;

SET COUNT = LENGTH(array_values) - LENGTH(REPLACE(array_values, ',', '')) + 1;
SET RANG_FLOOR = FLOOR(ROUND((x) * (COUNT-1),2));
SET DIFF_RANK = ((x) * (COUNT-1)) - FLOOR(ROUND((x) * (COUNT-1),2));

SET VALEUR_SUP = CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(array_values,',', RANG_FLOOR+2),',',-1) AS DECIMAL);
SET VALEUR_INF = CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(array_values,',', RANG_FLOOR+1),',',-1) AS DECIMAL);

/****
    https://fr.wikipedia.org/wiki/Quantile
    x_j+1 + g (x_j+2 - x_j+1)       
***/
RETURN  Round((VALEUR_INF + (DIFF_RANK* (VALEUR_SUP-VALEUR_INF) ) ),2);

END$$

Example :

Select centile(3/4,GROUP_CONCAT(lux ORDER BY lux SEPARATOR ',')) as quartile_3
FROM LuxLog
WHERE Sensor=12 AND Lux<>0
Federate answered 26/10, 2016 at 15:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.