Compute 2,3 quartile average in SQL
Asked Answered
S

5

14

I want to write a stored proc in SQL (MySQL) to compute the average of second and third quartiles.

In other words I have records for measurements for how long it takes for an URL to load. Records are (id,url,time) and they are many measurements for each URL. What I'm trying to do is for each URL remove the lowest and top 25% (i.e. lower and upper quartiles) and compute average of the remaining 25%-75% of loading times. And store this into another table.

I saw some examples for this for MS SQL and seemed to be relatively easy. But I have to use MySQL where :

  • LIMIT clause doesn't support percents ( no analogue to select top 25% )
  • LIMIT clause doesn't support its arguments to be variables (only constants)
  • functions don't support dynamic SQL (e.g. PREPARE and EXECUTE )

And I got as far as here :

create procedure G(
  IN val VARCHAR(10)
)
Begin
  select @cnt:=count(*) from test where a=val;
  select  @of:= @cnt /4;
  SELECT @len:= @cnt/2; 
  Prepare stmt from 'select * from test where a="a" LIMIT ?,?';
  execute stmt using @of, @len;
END;

I can write it in PHP but think in SQL it would have much better overall performance. I will appreciate some help very much.

Schinica answered 9/8, 2011 at 7:58 Comment(1)
Joe Celko's 'SQL For Smarties' has a chapter on statistics (mode, median, variance, etc.) It's worth the purchase price.Subterrane
V
3

Look at answer and comment by @Richard aka cyberkiwi in this question:

Select *
from
(
    SELECT tbl.*, @counter := @counter +1 counter
    FROM (select @counter:=0) initvar, tbl
    ORDER BY ordcolumn
) X
where counter >= (25/100 * @counter) and counter <= (75/100 * @counter);
ORDER BY ordcolumn
Vermis answered 9/8, 2011 at 8:37 Comment(0)
C
1

You can create the quartile values by using IF to set them to zero if in the wrong quartile:

Let's assume, the raw data table is created by

DROP TABLE IF EXISTS `rawdata`;
CREATE TABLE `rawdata` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `url` varchar(250) NOT NULL DEFAULT '',
  `time` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `time` (`time`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

(and ofcourse populated).

Let's also assume the quartile table data is created by

DROP TABLE IF EXISTS `quartiles`;
CREATE TABLE `quartiles` (
  `url` varchar(250) NOT NULL,
  `Q1` float DEFAULT '0',
  `Q2` float DEFAULT '0',
  `Q3` float DEFAULT '0',
  `Q4` float DEFAULT '0',
  PRIMARY KEY (`url`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

(and left empty).

Then a procedure to populate quartiles from rawdata would look like

DELIMITER ;;

CREATE PROCEDURE `ComputeQuartiles`()
    READS SQL DATA
BEGIN
    DECLARE numrows int DEFAULT 0;
    DECLARE qrows int DEFAULT 0;
    DECLARE rownum int DEFAULT 0;
    DECLARE done int DEFAULT 0;
    DECLARE currenturl VARCHAR(250) CHARACTER SET utf8;
    DECLARE Q1,Q2,Q3,Q4 float DEFAULT 0.0;
    DECLARE allurls CURSOR FOR SELECT DISTINCT url FROM rawdata;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET currenturl='';

    OPEN allurls;
    FETCH allurls INTO currenturl;
    WHILE currenturl<>'' DO
        SELECT COUNT(*) INTO numrows FROM rawdata WHERE url=currenturl;
        SET qrows=FLOOR(numrows/4);
        if qrows>0 THEN
            -- Only session parameters can be recalculated inside a query,
            -- so @rownum:=@rownum+1 will work, but rownum:=rownum+1 will not.
            SET @rownum=0;
            SELECT
                SUM(IFNULL(QA,0))/qrows, 
                SUM(IFNULL(QB,0))/qrows, 
                SUM(IFNULL(QC,0))/qrows, 
                SUM(IFNULL(QD,0))/qrows
            FROM (
                SELECT 
                    if(@rownum<qrows,time,0) AS QA,
                    if(@rownum>=qrows AND @rownum<2*qrows,time,0) AS QB,
                    -- the middle 0-3 rows are left out 
                    if(@rownum>=(numrows-2*qrows) AND @rownum<(numrows-qrows),time,0) AS QC,
                    if(@rownum>=(numrows-qrows),time,0) AS QD,
                    @rownum:=@rownum+1 AS dummy
                FROM rawdata
                WHERE url=currenturl ORDER BY time
            ) AS baseview
            INTO Q1,Q2,Q3,Q4
            ;
            REPLACE INTO quartiles values (currenturl,Q1,Q2,Q3,Q4);
        END IF;

        FETCH allurls INTO currenturl;
    END WHILE;
    CLOSE allurls;

END ;;

DELIMITER ;

The main points being:

  • Use a cursor to cycle the URLs (or adapt the sample to accept the URL as a parameter)
  • For every URL find the total number of rows
  • Do some trivial math to leave out the middle rows, if (rowcount % 4) != 0
  • select all raw rows for the URL, assigning the value of time to one of QA-QD, depending on the row number, assigning the other Qx the value 0
  • Use this query as a subquery to another one, which sums up and normalizes the values
  • Use the results of this superquery to update quartiles table

I tested this with 18432 raw rows, url=concat('http://.../',floor(rand()*10)), time=round(rand()*10000) on a 8x1.9GHz machine and it finished consistently in 0.50-0.54sec

Civilization answered 15/12, 2011 at 22:54 Comment(0)
L
0

how about this ?

prepare stmt from select concat('select * from test where a="a" LIMIT ',@of,@len);
execute stmt;
Lomasi answered 9/8, 2011 at 9:6 Comment(1)
1. This is still dynamic SQL and is no different that what I wrote 2. This line yields syntax error despite that it looks OK to meSchinica
R
0

Take a look at this excellent example of calculating percentiles with MySQL. I have used this with great success on some pretty large datasets.

http://planet.mysql.com/entry/?id=13588

Take note of the section relating to group_concat_max_len - this is really important. Setting this value to the maximum allowable value - which is your setting for max packet size, will ensure that if the string it builds gets too big, you'll get a proper error rather than just a 'truncated field' warning.

SET @@group_concat_max_len := @@max_allowed_packet;

What I would do is use this function to calculate the 25th and 75th percentiles (which can be done in a single query), and then calculate the averages of your remaining data by running a second query against the data.

<?php
$lowVal = /* result of query getting the 25%ile value */;
$highVal = /* result of query getting the 75%ile value */;

$strSQL = "SELECT AVG(`field`) AS myAvg 
             FROM `table` 
             WHERE { your_existing_criteria_goes_here }
                AND `filter_field` BETWEEN '{$lowVal}' AND '{$highVal}';"
/* Run the query and extract your data */
?>

Hope that all makes sense, and help with your problem :)

Rabid answered 25/10, 2011 at 8:6 Comment(1)
I know it's not a stored procedure, and done within PHP, and I'm sure that someone out there can re-write it as a stored procedure which is able to get the 25%ile and 75%ile values out of a single query and then apply them to the main filtered query, but the above is how I would do it. It then allows you to use the 25%ile and 75%ile values in other areas of your code, e.g. when creating some headings or legend information for the data table or chart you are building ;)Rabid
K
0

Why don't you just use one query this way:

select url, avg(time)
from mytable A
where time >
       (select min(B.time) + ((max(B.time)-min(B.time))/100*25)
          from mytable B where B.url = A.url)
and time <
       (select max(B.time) - ((max(B.time)-min(B.time))/100*25)
          from mytable B where B.url = A.url)
group by url;
Ku answered 14/11, 2011 at 17:24 Comment(1)
That doesn't necessarily get you the 2nd and 3rd quartile. If the times are uniformly distributed then it'll be close but as it is this doesn't do what is asked for.Fenian

© 2022 - 2024 — McMap. All rights reserved.