Pythonic way of detecting outliers in one dimensional observation data
Asked Answered
P

5

63

For the given data, I want to set the outlier values (defined by 95% confidense level or 95% quantile function or anything that is required) as nan values. Following is the my data and code that I am using right now. I would be glad if someone could explain me further.

import numpy as np, matplotlib.pyplot as plt

data = np.random.rand(1000)+5.0

plt.plot(data)
plt.xlabel('observation number')
plt.ylabel('recorded value')
plt.show()
Photomural answered 12/3, 2014 at 14:7 Comment(2)
You know your data better but I would think that Winsorising would be better than deletion. Moreover, if set those data as nan, you then have to handle that. Take a look at np.percentile function.Growing
FYI: Detect and exclude outliers in Pandas dataframeGlendaglenden
F
156

The problem with using percentile is that the points identified as outliers is a function of your sample size.

There are a huge number of ways to test for outliers, and you should give some thought to how you classify them. Ideally, you should use a-priori information (e.g. "anything above/below this value is unrealistic because...")

However, a common, not-too-unreasonable outlier test is to remove points based on their "median absolute deviation".

Here's an implementation for the N-dimensional case (from some code for a paper here: https://github.com/joferkington/oost_paper_code/blob/master/utilities.py):

def is_outlier(points, thresh=3.5):
    """
    Returns a boolean array with True if points are outliers and False 
    otherwise.

    Parameters:
    -----------
        points : An numobservations by numdimensions array of observations
        thresh : The modified z-score to use as a threshold. Observations with
            a modified z-score (based on the median absolute deviation) greater
            than this value will be classified as outliers.

    Returns:
    --------
        mask : A numobservations-length boolean array.

    References:
    ----------
        Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
        Handle Outliers", The ASQC Basic References in Quality Control:
        Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. 
    """
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

This is very similar to one of my previous answers, but I wanted to illustrate the sample size effect in detail.

Let's compare a percentile-based outlier test (similar to @CTZhu's answer) with a median-absolute-deviation (MAD) test for a variety of different sample sizes:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def main():
    for num in [10, 50, 100, 1000]:
        # Generate some data
        x = np.random.normal(0, 0.5, num-3)

        # Add three outliers...
        x = np.r_[x, -3, -10, 12]
        plot(x)

    plt.show()

def mad_based_outlier(points, thresh=3.5):
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

def percentile_based_outlier(data, threshold=95):
    diff = (100 - threshold) / 2.0
    minval, maxval = np.percentile(data, [diff, 100 - diff])
    return (data < minval) | (data > maxval)

def plot(x):
    fig, axes = plt.subplots(nrows=2)
    for ax, func in zip(axes, [percentile_based_outlier, mad_based_outlier]):
        sns.distplot(x, ax=ax, rug=True, hist=False)
        outliers = x[func(x)]
        ax.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)

    kwargs = dict(y=0.95, x=0.05, ha='left', va='top')
    axes[0].set_title('Percentile-based Outliers', **kwargs)
    axes[1].set_title('MAD-based Outliers', **kwargs)
    fig.suptitle('Comparing Outlier Tests with n={}'.format(len(x)), size=14)

main()

enter image description here


enter image description here


enter image description here


enter image description here

Notice that the MAD-based classifier works correctly regardless of sample-size, while the percentile based classifier classifies more points the larger the sample size is, regardless of whether or not they are actually outliers.

Forrest answered 12/3, 2014 at 16:26 Comment(22)
Joe,+1,This is a great answer. Although I am wondering, if the OP's data always uniformly disturbed (random.rand()), or may always follow some other distribution most of the time. If the data is always uniformly disturbed, I am not sure about using MAD.Ikon
@CTZhu - Good point, especially if the OP's data is, say, log-normally distributed. For vaguely symmetric distributions, deviations from a normal distribution shouldn't matter too much, for but strongly asymmetric distributions such as lognormal, MAD isn't a good choice. (Though you could always apply it in log-space to get around that.) All of this just serves to underscore the point that you should put some thought into whatever outlier test you choose.Forrest
@JoeKington every where you are using median, but diff is calculated as L2 norm ( **2 ); median is the value which minimizes L1 norm, whereas in L2 norm 'mean' is the center; i was expecting that if you start with median stay in L1 norm. do you have any reason **2 would work better than absolute values in calculating diff?Yeta
@Yeta - Now that you mention it, it does seem very odd to use the L2 norm for this (and it's not in fact the median absolute deviation, then). If I recall correctly, I loosely based it on a couple of implementations in various other languages that did the same thing. I'll need to take another look back at the original reference. If I just change things to the L1 norm, it doesn't work quite correctly, at any rate...Forrest
@Joe Kington Thank you so much in deed. Could you please explain me if I can use the method provided in the following link for the detection of outliers?stats.stackexchange.com/questions/28593/… This is just for comparison purposes.Photomural
in your mad based outlier detection, how did you set the values 0.6745 and 3.5 for what purpose, how to determine those values? it's very confusingPhotomural
@julie - You'll have to see the reference. They're derived there. Basically the 0.6745 is to make the values roughly equivalent in units to standard deviations and the 3.5 is the recommended threshold (roughly equivalent to 3.5 standard deviations). Sorry I haven't been replying to much lately!Forrest
@julie - I only have a hardcopy of the reference, but there's a good discussion here: habcam.whoi.edu/HabCamData/HAB/processed/…Forrest
So these values should be derived from our data, rather than assigning 0.6745 and 3.5 for any data, right?Photomural
Well, the 0.6745 should be constant, regardless of the dataset. You may want to adjust the 3.5 depending on how close or far you expect the outliers to be to the rest if the data.Forrest
The modified Z-value calculated in function mad_based_outlier looks different from the original one in nist.gov. The med_abs_deviation calculated in the code is different from the definition of the mean absolute deviation.Darrickdarrill
The comparative plots are VERY helpful for getting a feel for how at least these two methods differ. Thank you for gong through the effort to create them.Boyle
There is something I don't understand in is_outlier. diff is a single value (because of np.sum) - then you use np.median on diff -- is that typo? Why would you calculate the np.median of a single value. Am I missing something here?Hemipode
@Hemipode - Notice the axis kwarg to np.sum - It's not returning a single value, it's returning the sum along the last axis, which is an array. For example: np.sum(np.ones((3, 4)), axis=-1) yields array([4, 4, 4]).Forrest
@JoeKington - My bad, .. I passed an (L, ) numpy array instead of (L, 1). I didnt read the docs Thanks.Hemipode
This answer (and your code) fails if more than half of the items in the set have the same value, since the median(points - median(points)) will be zero, dividing by zero will result in all np.inf.Fizzle
Alternative mirror for @JoeKington 's PDF paper pdf-archive.com/2016/07/29/outlier-methods-external/…Varletry
@JoeKington I am having trouble getting your MAD implementation to match up with MAD from statsmodel.robust.mad here. Are you aware of any discrepancy?Incertitude
Great Answer but from where did the value : 0.6745 comes?Leuco
@JoeKington Could you please explain why you're using np.sqrt(diff) before computing the median of diff?Dwarf
@KurtBourbaki I figured out why. He basically trying to do abs(points - median) but he over-complicated things!Schmo
@Schmo Agree - this post has an excellent explanation but the code is, unfortunately, inefficient and over-complicated. It raises all the differences to the power of 2 and then takes a sqrt from each. Why would we do this if we can simply take the absolute values?Mudlark
L
18

Detection of outliers in one dimensional data depends on its distribution

1- Normal Distribution :

  1. Data values are almost equally distributed over the expected range : In this case you easily use all the methods that include mean ,like the confidence interval of 3 or 2 standard deviations(95% or 99.7%) accordingly for a normally distributed data (central limit theorem and sampling distribution of sample mean).I is a highly effective method. Explained in Khan Academy statistics and Probability - sampling distribution library.

One other way is prediction interval if you want confidence interval of data points rather than mean.

  1. Data values are are randomly distributed over a range: mean may not be a fair representation of the data, because the average is easily influenced by outliers (very small or large values in the data set that are not typical) The median is another way to measure the center of a numerical data set.

    Median Absolute deviation - a method which measures the distance of all points from the median in terms of median distance http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm - has a good explanation as explained in Joe Kington's answer above

2 - Symmetric Distribution : Again Median Absolute Deviation is a good method if the z-score calculation and threshold is changed accordingly

Explanation : http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/

3 - Asymmetric Distribution : Double MAD - Double Median Absolute Deviation Explanation in the above attached link

Attaching my python code for reference :

 def is_outlier_doubleMAD(self,points):
    """
    FOR ASSYMMETRIC DISTRIBUTION
    Returns : filtered array excluding the outliers

    Parameters : the actual data Points array

    Calculates median to divide data into 2 halves.(skew conditions handled)
    Then those two halves are treated as separate data with calculation same as for symmetric distribution.(first answer) 
    Only difference being , the thresholds are now the median distance of the right and left median with the actual data median
    """

    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    medianIndex = (points.size/2)

    leftData = np.copy(points[0:medianIndex])
    rightData = np.copy(points[medianIndex:points.size])

    median1 = np.median(leftData, axis=0)
    diff1 = np.sum((leftData - median1)**2, axis=-1)
    diff1 = np.sqrt(diff1)

    median2 = np.median(rightData, axis=0)
    diff2 = np.sum((rightData - median2)**2, axis=-1)
    diff2 = np.sqrt(diff2)

    med_abs_deviation1 = max(np.median(diff1),0.000001)
    med_abs_deviation2 = max(np.median(diff2),0.000001)

    threshold1 = ((median-median1)/med_abs_deviation1)*3
    threshold2 = ((median2-median)/med_abs_deviation2)*3

    #if any threshold is 0 -> no outliers
    if threshold1==0:
        threshold1 = sys.maxint
    if threshold2==0:
        threshold2 = sys.maxint
    #multiplied by a factor so that only the outermost points are removed
    modified_z_score1 = 0.6745 * diff1 / med_abs_deviation1
    modified_z_score2 = 0.6745 * diff2 / med_abs_deviation2

    filtered1 = []
    i = 0
    for data in modified_z_score1:
        if data < threshold1:
            filtered1.append(leftData[i])
        i += 1
    i = 0
    filtered2 = []
    for data in modified_z_score2:
        if data < threshold2:
            filtered2.append(rightData[i])
        i += 1

    filtered = filtered1 + filtered2
    return filtered
Lesleelesley answered 11/1, 2017 at 15:7 Comment(2)
In Python 3, it should be medianIndex = int(points.size/2). Also, if I run the code and set a threshold to zero, it crashes with the message name 'sys' is not defined. Laslty, the self in the function call never gets used.Hydranth
you can also use: medianIndex = points.size//2 to avoid float valueGiliana
B
16

I've adapted the code from http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers and it gives the same results as Joe Kington's, but uses L1 distance instead of L2 distance, and has support for asymmetric distributions. The original R code did not have Joe's 0.6745 multiplier, so I also added that in for consistency within this thread. Not 100% sure if it's necessary, but makes the comparison apples-to-apples.

def doubleMADsfromMedian(y,thresh=3.5):
    # warning: this function does not check for NAs
    # nor does it address issues when 
    # more than 50% of your data have identical values
    m = np.median(y)
    abs_dev = np.abs(y - m)
    left_mad = np.median(abs_dev[y <= m])
    right_mad = np.median(abs_dev[y >= m])
    y_mad = left_mad * np.ones(len(y))
    y_mad[y > m] = right_mad
    modified_z_score = 0.6745 * abs_dev / y_mad
    modified_z_score[y == m] = 0
    return modified_z_score > thresh
Browse answered 24/3, 2015 at 0:39 Comment(14)
How to use MAD based approach on multivariate data? The article you mentioned is great but works on single dimensional data I guess. I would like to know the simplest way to modify it to make it applicable for multivariate data as well.Roca
There's no easy way to do this for multivariate data. A simple way is to just apply the method to one variable at a time and see if some samples are outliers in any dimensions.Browse
@Browse How do we choose the threshold ? Read through the original post , couldn't dig it there as well .Dualism
@Dualism the usual and unfortunate answer is try a bunch of different thresholds with your actual data and see what gets left out. Since this is 1-D you can do visualizations like the ones in Joe Kington's answer. If it makes it easier you can think of the threshold as sort of like the number of standard deviations. So 3.5 is a lot. But I've used numbers closer to 6 before - just depends on your data.Browse
I think you should replace y_mad[y < m] with y_mad[y <= m],and y_mad[y > m] with y_mad[y >= m], otherwise when y equal to m, y_mad will be zero.Yorktown
Thanks @Yorktown - I've fixed it!Browse
There a problem with this algo. I'm testing with 11724, 21513, 20536, 16386, 17586, 11736, 11212, 7861, 8515, 9046, 7016, 90 set. It wont return True if the outlier either first or last element of the series.Chomp
@7H3IN5ID3R In your example, the median is about 11k, and 20k is as far from 11k as 90 is, so I am not sure what the outlier should actually be here. These are all the sorted actual absolute deviations from median: [256.0, 256.0, 268.0, 2422.0, 2953.0, 3607.0, 4452.0, 4918.0, 6118.0, 9068.0, 10045.0, 11378.0]. Which of these are outliers in your thought?Browse
I don't think there is a outlier there in that series. I'm really not sure about it. I thought outlier is something which deviates from the series. My problem is, I have to find either sudden drop or raise of a data series.Chomp
@Browse shouldn't modified_z_score = 0.6745 * abs_dev / y_mad be 0.6745*(y - m)/y_mad. Because the actual formula is 0.6745(mean(x) - x)/MAD, so i don't think we should multiply abs_dev with 0.6745.Giliana
@TheAG why "0.6745*(y - m)/y_mad"? the numerator is currently "abs_dev = np.abs(y - m)". Your suggestion is to remove the "abs"?Browse
yes buddy.. As per formula, we take the difference of Mi = 0.6745*(Yi - Ymedian) / MAD... One of the article: support.infogix.com/hc/en-us/community/posts/…Giliana
@TheAG Ah I see what you're saying. The reason I made it absolute is that we don't care if the outlier is in the right tail or the left tail. But if you care, then removing the absolute value makes sense!Browse
gotcha..!! Thank you @BrowseGiliana
C
4

Well a simple solution can also be, removing something which outside 2 standard deviations(or 1.96):

import random
def outliers(tmp):
    """tmp is a list of numbers"""
    outs = []
    mean = sum(tmp)/(1.0*len(tmp))
    var = sum((tmp[i] - mean)**2 for i in range(0, len(tmp)))/(1.0*len(tmp))
    std = var**0.5
    outs = [tmp[i] for i in range(0, len(tmp)) if abs(tmp[i]-mean) > 1.96*std]
    return outs


lst = [random.randrange(-10, 55) for _ in range(40)]
print lst
print outliers(lst)
Cindicindie answered 11/3, 2017 at 3:25 Comment(3)
is this for python 2 ?Adsorbate
what should i use instead of xrange in python 3 ?Adsorbate
xrange in python 2 is same as range in python 3. No more xrange in python 3.Cindicindie
I
3

Use np.percentile as @Martin suggested:

percentiles = np.percentile(data, [2.5, 97.5])

# or =>, <= for within 95%
data[(percentiles[0]<data) & (percentiles[1]>data)]

# set the outliners to np.nan
data[(percentiles[0]>data) | (percentiles[1]<data)] = np.nan
Ikon answered 12/3, 2014 at 14:40 Comment(6)
Using percentiles of the data as an outlier test is a reasonable first pass, but it's not ideal. The problem is 1) that you'll remove some data, even if it's not an outlier, and 2) the outliers heavily influence the variance, and therefore the percentile values. The most common outlier tests use "median absolute deviation" which is less sensitive to the presence of outliers.Forrest
@Joe Kington I would be grateful if you could implement your way using python code.Photomural
@Joe Kington i saw the link you answered. However, is not there a more easier way to make it primarily using the functions available in numpyPhotomural
@julie - That function extensively uses numpy (it requires a numpy array as input and outputs a numpy array). An outlier test is far beyond the scope of numpy. (numpy itself just contains the core data structure and a few basic operations. It's deliberately small.) You can make an argument that scipy.stats would be a reasonable place for an outlier test, but there are many of them, and there's no single best test. Therefore, there isn't currently a single-function outlier test.Forrest
Statsmodels has a median-absolute deviation function in sm.robust.mad. I'm not sure there are facilities for univariate outlier tests, but there are for influence/outliers in a regression framework. Will see about adding some tools for univariate outlier detection.Sternutatory
@Sternutatory - Didn't know that was there! Thanks!Forrest

© 2022 - 2024 — McMap. All rights reserved.