how to get average that ignores outliers?
Asked Answered
T

5

5

say I have a postgresql table with the following values:

id | value
----------
1  | 4
2  | 8
3  | 100
4  | 5
5  | 7

If I use postgresql to calculate the average, it gives me an average of 24.8 because the high value of 100 has great impact on the calculation. While in fact I would like to find an average somewhere around 6 and eliminate the extreme(s).

I am looking for a way to eliminate extremes and want to do this "statistically correct". The extreme's cannot be fixed. I cannot say; If a value is over X, it has to be eliminated.

I have been bending my head on the postgresql aggregate functions but cannot put my finger on what is right for me to use. Any suggestions?

Therapeutics answered 29/5, 2010 at 8:41 Comment(1)
You are looking for a truncated meanGifferd
P
8

I cannot say; If a value is over X, it has to be eliminated.

Well, you could use having and a subselect to eliminate outliers, something like:

HAVING value < (
 SELECT 2 * avg(value)
 FROM   mytable
 GROUP BY ...
)

(Or, for that matter, use a more complex version to eliminate anything above 2 or 3 standard deviations if you want something that will be better at eliminating only outliers.)

The other option is to look at generating a median value, which is a fairly statistically sound way of accounting for outliers; happily there are three reasonable examples of just that: one from the Postgresql Wiki, one built as an Oracle compatability layer, and another from the PostgreSQL Journal. Note the caveats around how precisely/accurately they implement medians.

Patrick answered 29/5, 2010 at 10:28 Comment(1)
Excelent answer, especially the wiki page on aggregate median! I will however, as Peter Tillemans suggest, combine it with the stddev. But since your answer contains the most hints, I will rate it as the correct answer.Therapeutics
F
13

Postgresql can also calculate the standard deviation.

You could take only the data points which are in the average() +/- 2*stddev() which would roughly correspond to the 90% datapoints closest to the average.

Of course 2 can also be 3 (95%) or 6 (99.995%) but do not get hung up on the numbers because in the presence of a collection outliers you are no longer dealing with a normal distribution.

Be very careful and validate that it works as expected.

Foulup answered 29/5, 2010 at 8:59 Comment(2)
This sounds good! I didn't know stddev would result in percentages of the set although it sounds perfectly legit. I know if I combine your answer with the one by Rodger, I must be on the right track!Therapeutics
It seems you're assuming this is a normal distribution (which is very hard to say from the example in the question, in fact, from 5 datapoints like this, it looks like it isn't). If so, your percentages aren't quite right either.Arne
P
8

I cannot say; If a value is over X, it has to be eliminated.

Well, you could use having and a subselect to eliminate outliers, something like:

HAVING value < (
 SELECT 2 * avg(value)
 FROM   mytable
 GROUP BY ...
)

(Or, for that matter, use a more complex version to eliminate anything above 2 or 3 standard deviations if you want something that will be better at eliminating only outliers.)

The other option is to look at generating a median value, which is a fairly statistically sound way of accounting for outliers; happily there are three reasonable examples of just that: one from the Postgresql Wiki, one built as an Oracle compatability layer, and another from the PostgreSQL Journal. Note the caveats around how precisely/accurately they implement medians.

Patrick answered 29/5, 2010 at 10:28 Comment(1)
Excelent answer, especially the wiki page on aggregate median! I will however, as Peter Tillemans suggest, combine it with the stddev. But since your answer contains the most hints, I will rate it as the correct answer.Therapeutics
D
4

Here's an aggregate function which will calculate the trimmed mean for a set of values, excluding values outside N standard deviations from the mean.

Example:

DROP TABLE IF EXISTS foo;
CREATE TEMPORARY TABLE foo (x FLOAT);
INSERT INTO foo VALUES (1);
INSERT INTO foo VALUES (2);
INSERT INTO foo VALUES (3);
INSERT INTO foo VALUES (4);
INSERT INTO foo VALUES (100);

SELECT avg(x), tmean(x, 2), tmean(x, 1.5) FROM foo;

--  avg | tmean | tmean 
-- -----+-------+-------
--   22 |    22 |   2.5

Code:

DROP TYPE IF EXISTS tmean_stype CASCADE;

CREATE TYPE tmean_stype AS (
  deviations FLOAT,
    count INT,
    acc FLOAT,
    acc2 FLOAT,
    vals FLOAT[]
);

CREATE OR REPLACE FUNCTION tmean_sfunc(tmean_stype, float, float)
RETURNS tmean_stype AS $$
    SELECT $3, $1.count + 1, $1.acc + $2, $1.acc2 + ($2 * $2), array_append($1.vals, $2);
$$ LANGUAGE SQL;

CREATE OR REPLACE FUNCTION tmean_finalfunc(tmean_stype)
RETURNS float AS $$
DECLARE
    fcount INT;
    facc FLOAT;
    mean FLOAT;
    stddev FLOAT;
    lbound FLOAT;
    ubound FLOAT;
    val FLOAT;
BEGIN
    mean := $1.acc / $1.count;
    stddev := sqrt(($1.acc2 / $1.count) - (mean * mean));
    lbound := mean - stddev * $1.deviations;
    ubound := mean + stddev * $1.deviations;
    -- RAISE NOTICE 'mean: % stddev: % lbound: % ubound: %', mean, stddev, lbound, ubound;

    fcount := 0;
    facc := 0;
    FOR i IN array_lower($1.vals, 1) .. array_upper($1.vals, 1) LOOP
        val := $1.vals[i];
        IF val >= lbound AND val <= ubound THEN
            fcount := fcount + 1;
            facc := facc + val;
        END IF; 
    END LOOP;

    IF fcount = 0 THEN
        return NULL;
    END IF;
    RETURN facc / fcount;
END;
$$ LANGUAGE plpgsql;

CREATE AGGREGATE tmean(float, float)
(
    SFUNC = tmean_sfunc,
    STYPE = tmean_stype,
    FINALFUNC = tmean_finalfunc,
    INITCOND = '(-1, 0, 0, 0, {})'
);

Gist (which should be identical): https://gist.github.com/4458294

Dopester answered 4/1, 2013 at 23:3 Comment(0)
L
2

Mind using the ntile window function. It allows you to easily isolate extreme values from the result set.

Let's say you want to cut 10% from both sides of the result set. Then passing the value of 10 to ntile and looking for values between 2 and 9 would give you the desired result. Keep also in mind that if you have less than 10 records, you might accidentally cut more than 20%, so be sure to check the total amount of records as well.

WITH yyy AS (
  SELECT
    id,
    value,
    NTILE(10) OVER (ORDER BY value) AS ntiled,
    COUNT(*) OVER () AS counted
  FROM
    xxx)
SELECT
  *
FROM
  yyy
WHERE
  counted < 10 OR ntiled BETWEEN 2 AND 9;
Lacroix answered 29/6, 2016 at 14:28 Comment(0)
M
2

You can use IQR to filter outliers. PL/pgSQL code:

select percentile_cont(0.25) WITHIN GROUP (ORDER BY value)
  into q1
  from table;
select percentile_cont(0.75) WITHIN GROUP (ORDER BY value)
  into q3
  from table;

iqr := q3 - q1;
min := q1 - 1.5 * iqr;
max := q3 + 1.5 * iqr;

select value
  into result
  from table
  where value >= min and value <= max;
return result;
Moran answered 28/11, 2019 at 11:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.