Pythonic way of detecting outliers in one dimensional observation data

Asked 12/3, 2014 at 14:7 Answered 11/3, 2017 at 3:25

python numpy matplotlib statistics statsmodels

For the given data, I want to set the outlier values (defined by 95% confidense level or 95% quantile function or anything that is required) as nan values. Following is the my data and code that I am using right now. I would be glad if someone could explain me further.

import numpy as np, matplotlib.pyplot as plt

data = np.random.rand(1000)+5.0

plt.plot(data)
plt.xlabel('observation number')
plt.ylabel('recorded value')
plt.show()

Photomural answered 12/3, 2014 at 14:7 Comment(2)

You know your data better but I would think that Winsorising would be better than deletion. Moreover, if set those data as nan, you then have to handle that. Take a look at np.percentile function. – Growing 12/3, 2014 at 14:18

FYI: Detect and exclude outliers in Pandas dataframe – Glendaglenden 18/8, 2017 at 1:16

156

The problem with using percentile is that the points identified as outliers is a function of your sample size.

There are a huge number of ways to test for outliers, and you should give some thought to how you classify them. Ideally, you should use a-priori information (e.g. "anything above/below this value is unrealistic because...")

However, a common, not-too-unreasonable outlier test is to remove points based on their "median absolute deviation".

Here's an implementation for the N-dimensional case (from some code for a paper here: https://github.com/joferkington/oost_paper_code/blob/master/utilities.py):

def is_outlier(points, thresh=3.5):
    """
    Returns a boolean array with True if points are outliers and False 
    otherwise.

    Parameters:
    -----------
        points : An numobservations by numdimensions array of observations
        thresh : The modified z-score to use as a threshold. Observations with
            a modified z-score (based on the median absolute deviation) greater
            than this value will be classified as outliers.

    Returns:
    --------
        mask : A numobservations-length boolean array.

    References:
    ----------
        Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
        Handle Outliers", The ASQC Basic References in Quality Control:
        Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. 
    """
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

This is very similar to one of my previous answers, but I wanted to illustrate the sample size effect in detail.

Let's compare a percentile-based outlier test (similar to @CTZhu's answer) with a median-absolute-deviation (MAD) test for a variety of different sample sizes:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def main():
    for num in [10, 50, 100, 1000]:
        # Generate some data
        x = np.random.normal(0, 0.5, num-3)

        # Add three outliers...
        x = np.r_[x, -3, -10, 12]
        plot(x)

    plt.show()

def mad_based_outlier(points, thresh=3.5):
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

def percentile_based_outlier(data, threshold=95):
    diff = (100 - threshold) / 2.0
    minval, maxval = np.percentile(data, [diff, 100 - diff])
    return (data < minval) | (data > maxval)

def plot(x):
    fig, axes = plt.subplots(nrows=2)
    for ax, func in zip(axes, [percentile_based_outlier, mad_based_outlier]):
        sns.distplot(x, ax=ax, rug=True, hist=False)
        outliers = x[func(x)]
        ax.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)

    kwargs = dict(y=0.95, x=0.05, ha='left', va='top')
    axes[0].set_title('Percentile-based Outliers', **kwargs)
    axes[1].set_title('MAD-based Outliers', **kwargs)
    fig.suptitle('Comparing Outlier Tests with n={}'.format(len(x)), size=14)

main()

enter image description here

Notice that the MAD-based classifier works correctly regardless of sample-size, while the percentile based classifier classifies more points the larger the sample size is, regardless of whether or not they are actually outliers.

Forrest answered 12/3, 2014 at 16:26 Comment(22)

Joe,+1,This is a great answer. Although I am wondering, if the OP's data always uniformly disturbed (random.rand()), or may always follow some other distribution most of the time. If the data is always uniformly disturbed, I am not sure about using MAD. – Ikon 12/3, 2014 at 16:53

@CTZhu - Good point, especially if the OP's data is, say, log-normally distributed. For vaguely symmetric distributions, deviations from a normal distribution shouldn't matter too much, for but strongly asymmetric distributions such as lognormal, MAD isn't a good choice. (Though you could always apply it in log-space to get around that.) All of this just serves to underscore the point that you should put some thought into whatever outlier test you choose. – Forrest 12/3, 2014 at 17:58

@JoeKington every where you are using median, but diff is calculated as L2 norm ( **2 ); median is the value which minimizes L1 norm, whereas in L2 norm 'mean' is the center; i was expecting that if you start with median stay in L1 norm. do you have any reason **2 would work better than absolute values in calculating diff? – Yeta 12/3, 2014 at 21:15

@Yeta - Now that you mention it, it does seem very odd to use the L2 norm for this (and it's not in fact the median absolute deviation, then). If I recall correctly, I loosely based it on a couple of implementations in various other languages that did the same thing. I'll need to take another look back at the original reference. If I just change things to the L1 norm, it doesn't work quite correctly, at any rate... – Forrest 13/3, 2014 at 2:56

@Joe Kington Thank you so much in deed. Could you please explain me if I can use the method provided in the following link for the detection of outliers?stats.stackexchange.com/questions/28593/… This is just for comparison purposes. – Photomural 14/3, 2014 at 11:23

in your mad based outlier detection, how did you set the values 0.6745 and 3.5 for what purpose, how to determine those values? it's very confusing – Photomural 2/4, 2014 at 8:7

@julie - You'll have to see the reference. They're derived there. Basically the 0.6745 is to make the values roughly equivalent in units to standard deviations and the 3.5 is the recommended threshold (roughly equivalent to 3.5 standard deviations). Sorry I haven't been replying to much lately! – Forrest 2/4, 2014 at 11:53

@julie - I only have a hardcopy of the reference, but there's a good discussion here: habcam.whoi.edu/HabCamData/HAB/processed/… – Forrest 2/4, 2014 at 11:56

So these values should be derived from our data, rather than assigning 0.6745 and 3.5 for any data, right? – Photomural 2/4, 2014 at 13:53

Well, the 0.6745 should be constant, regardless of the dataset. You may want to adjust the 3.5 depending on how close or far you expect the outliers to be to the rest if the data. – Forrest 2/4, 2014 at 14:29

The modified Z-value calculated in function mad_based_outlier looks different from the original one in nist.gov. The med_abs_deviation calculated in the code is different from the definition of the mean absolute deviation. – Darrickdarrill 8/5, 2014 at 9:19

The comparative plots are VERY helpful for getting a feel for how at least these two methods differ. Thank you for gong through the effort to create them. – Boyle 15/4, 2015 at 18:10

There is something I don't understand in is_outlier. diff is a single value (because of np.sum) - then you use np.median on diff -- is that typo? Why would you calculate the np.median of a single value. Am I missing something here? – Hemipode 29/3, 2016 at 11:21

@Hemipode - Notice the axis kwarg to np.sum - It's not returning a single value, it's returning the sum along the last axis, which is an array. For example: np.sum(np.ones((3, 4)), axis=-1) yields array([4, 4, 4]). – Forrest 29/3, 2016 at 13:53

@JoeKington - My bad, .. I passed an (L, ) numpy array instead of (L, 1). I didnt read the docs Thanks. – Hemipode 30/3, 2016 at 0:29

This answer (and your code) fails if more than half of the items in the set have the same value, since the median(points - median(points)) will be zero, dividing by zero will result in all np.inf. – Fizzle 14/4, 2016 at 4:15

Alternative mirror for @JoeKington 's PDF paper pdf-archive.com/2016/07/29/outlier-methods-external/… – Varletry 29/7, 2016 at 19:18

@JoeKington I am having trouble getting your MAD implementation to match up with MAD from statsmodel.robust.mad here. Are you aware of any discrepancy? – Incertitude 8/3, 2017 at 22:57

Great Answer but from where did the value : 0.6745 comes? – Leuco 20/3, 2017 at 4:25

@JoeKington Could you please explain why you're using np.sqrt(diff) before computing the median of diff? – Dwarf 10/8, 2017 at 15:30

@KurtBourbaki I figured out why. He basically trying to do abs(points - median) but he over-complicated things! – Schmo 17/5, 2019 at 22:59

@Schmo Agree - this post has an excellent explanation but the code is, unfortunately, inefficient and over-complicated. It raises all the differences to the power of 2 and then takes a sqrt from each. Why would we do this if we can simply take the absolute values? – Mudlark 24/2, 2020 at 19:1

Detection of outliers in one dimensional data depends on its distribution

1- Normal Distribution :

Data values are almost equally distributed over the expected range : In this case you easily use all the methods that include mean ,like the confidence interval of 3 or 2 standard deviations(95% or 99.7%) accordingly for a normally distributed data (central limit theorem and sampling distribution of sample mean).I is a highly effective method. Explained in Khan Academy statistics and Probability - sampling distribution library.

One other way is prediction interval if you want confidence interval of data points rather than mean.

Data values are are randomly distributed over a range: mean may not be a fair representation of the data, because the average is easily influenced by outliers (very small or large values in the data set that are not typical) The median is another way to measure the center of a numerical data set.

Median Absolute deviation - a method which measures the distance of all points from the median in terms of median distance http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm - has a good explanation as explained in Joe Kington's answer above

2 - Symmetric Distribution : Again Median Absolute Deviation is a good method if the z-score calculation and threshold is changed accordingly

Explanation : http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/

3 - Asymmetric Distribution : Double MAD - Double Median Absolute Deviation Explanation in the above attached link

Attaching my python code for reference :

 def is_outlier_doubleMAD(self,points):
    """
    FOR ASSYMMETRIC DISTRIBUTION
    Returns : filtered array excluding the outliers

    Parameters : the actual data Points array

    Calculates median to divide data into 2 halves.(skew conditions handled)
    Then those two halves are treated as separate data with calculation same as for symmetric distribution.(first answer) 
    Only difference being , the thresholds are now the median distance of the right and left median with the actual data median
    """

    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    medianIndex = (points.size/2)

    leftData = np.copy(points[0:medianIndex])
    rightData = np.copy(points[medianIndex:points.size])

    median1 = np.median(leftData, axis=0)
    diff1 = np.sum((leftData - median1)**2, axis=-1)
    diff1 = np.sqrt(diff1)

    median2 = np.median(rightData, axis=0)
    diff2 = np.sum((rightData - median2)**2, axis=-1)
    diff2 = np.sqrt(diff2)

    med_abs_deviation1 = max(np.median(diff1),0.000001)
    med_abs_deviation2 = max(np.median(diff2),0.000001)

    threshold1 = ((median-median1)/med_abs_deviation1)*3
    threshold2 = ((median2-median)/med_abs_deviation2)*3

    #if any threshold is 0 -> no outliers
    if threshold1==0:
        threshold1 = sys.maxint
    if threshold2==0:
        threshold2 = sys.maxint
    #multiplied by a factor so that only the outermost points are removed
    modified_z_score1 = 0.6745 * diff1 / med_abs_deviation1
    modified_z_score2 = 0.6745 * diff2 / med_abs_deviation2

    filtered1 = []
    i = 0
    for data in modified_z_score1:
        if data < threshold1:
            filtered1.append(leftData[i])
        i += 1
    i = 0
    filtered2 = []
    for data in modified_z_score2:
        if data < threshold2:
            filtered2.append(rightData[i])
        i += 1

    filtered = filtered1 + filtered2
    return filtered

Lesleelesley answered 11/1, 2017 at 15:7 Comment(2)

In Python 3, it should be medianIndex = int(points.size/2). Also, if I run the code and set a threshold to zero, it crashes with the message name 'sys' is not defined. Laslty, the self in the function call never gets used. – Hydranth 29/4, 2019 at 13:56

you can also use: medianIndex = points.size//2 to avoid float value – Giliana 6/8, 2020 at 19:19

I've adapted the code from http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers and it gives the same results as Joe Kington's, but uses L1 distance instead of L2 distance, and has support for asymmetric distributions. The original R code did not have Joe's 0.6745 multiplier, so I also added that in for consistency within this thread. Not 100% sure if it's necessary, but makes the comparison apples-to-apples.

def doubleMADsfromMedian(y,thresh=3.5):
    # warning: this function does not check for NAs
    # nor does it address issues when 
    # more than 50% of your data have identical values
    m = np.median(y)
    abs_dev = np.abs(y - m)
    left_mad = np.median(abs_dev[y <= m])
    right_mad = np.median(abs_dev[y >= m])
    y_mad = left_mad * np.ones(len(y))
    y_mad[y > m] = right_mad
    modified_z_score = 0.6745 * abs_dev / y_mad
    modified_z_score[y == m] = 0
    return modified_z_score > thresh

Browse answered 24/3, 2015 at 0:39 Comment(14)

How to use MAD based approach on multivariate data? The article you mentioned is great but works on single dimensional data I guess. I would like to know the simplest way to modify it to make it applicable for multivariate data as well. – Roca 18/11, 2015 at 20:12

There's no easy way to do this for multivariate data. A simple way is to just apply the method to one variable at a time and see if some samples are outliers in any dimensions. – Browse 19/11, 2015 at 22:5

@Browse How do we choose the threshold ? Read through the original post , couldn't dig it there as well . – Dualism 2/2, 2016 at 10:20

@Dualism the usual and unfortunate answer is try a bunch of different thresholds with your actual data and see what gets left out. Since this is 1-D you can do visualizations like the ones in Joe Kington's answer. If it makes it easier you can think of the threshold as sort of like the number of standard deviations. So 3.5 is a lot. But I've used numbers closer to 6 before - just depends on your data. – Browse 9/2, 2016 at 17:49

I think you should replace y_mad[y < m] with y_mad[y <= m],and y_mad[y > m] with y_mad[y >= m], otherwise when y equal to m, y_mad will be zero. – Yorktown 8/8, 2016 at 12:32

Thanks @Yorktown - I've fixed it! – Browse 8/8, 2016 at 18:36

There a problem with this algo. I'm testing with 11724, 21513, 20536, 16386, 17586, 11736, 11212, 7861, 8515, 9046, 7016, 90 set. It wont return True if the outlier either first or last element of the series. – Chomp 19/12, 2016 at 11:8

@7H3IN5ID3R In your example, the median is about 11k, and 20k is as far from 11k as 90 is, so I am not sure what the outlier should actually be here. These are all the sorted actual absolute deviations from median: [256.0, 256.0, 268.0, 2422.0, 2953.0, 3607.0, 4452.0, 4918.0, 6118.0, 9068.0, 10045.0, 11378.0]. Which of these are outliers in your thought? – Browse 21/12, 2016 at 14:17

I don't think there is a outlier there in that series. I'm really not sure about it. I thought outlier is something which deviates from the series. My problem is, I have to find either sudden drop or raise of a data series. – Chomp 21/12, 2016 at 15:34

@Browse shouldn't modified_z_score = 0.6745 * abs_dev / y_mad be 0.6745*(y - m)/y_mad. Because the actual formula is 0.6745(mean(x) - x)/MAD, so i don't think we should multiply abs_dev with 0.6745. – Giliana 23/9, 2020 at 21:18

@TheAG why "0.6745*(y - m)/y_mad"? the numerator is currently "abs_dev = np.abs(y - m)". Your suggestion is to remove the "abs"? – Browse 24/9, 2020 at 22:21

yes buddy.. As per formula, we take the difference of Mi = 0.6745*(Yi - Ymedian) / MAD... One of the article: support.infogix.com/hc/en-us/community/posts/… – Giliana 25/9, 2020 at 14:31

@TheAG Ah I see what you're saying. The reason I made it absolute is that we don't care if the outlier is in the right tail or the left tail. But if you care, then removing the absolute value makes sense! – Browse 26/9, 2020 at 17:16

gotcha..!! Thank you @Browse – Giliana 28/9, 2020 at 17:48

Well a simple solution can also be, removing something which outside 2 standard deviations(or 1.96):

import random
def outliers(tmp):
    """tmp is a list of numbers"""
    outs = []
    mean = sum(tmp)/(1.0*len(tmp))
    var = sum((tmp[i] - mean)**2 for i in range(0, len(tmp)))/(1.0*len(tmp))
    std = var**0.5
    outs = [tmp[i] for i in range(0, len(tmp)) if abs(tmp[i]-mean) > 1.96*std]
    return outs


lst = [random.randrange(-10, 55) for _ in range(40)]
print lst
print outliers(lst)

Cindicindie answered 11/3, 2017 at 3:25 Comment(3)

is this for python 2 ? – Adsorbate 26/10, 2017 at 9:53

what should i use instead of xrange in python 3 ? – Adsorbate 26/10, 2017 at 9:58

xrange in python 2 is same as range in python 3. No more xrange in python 3. – Cindicindie 4/11, 2017 at 22:47

Use np.percentile as @Martin suggested:

percentiles = np.percentile(data, [2.5, 97.5])

# or =>, <= for within 95%
data[(percentiles[0]<data) & (percentiles[1]>data)]

# set the outliners to np.nan
data[(percentiles[0]>data) | (percentiles[1]<data)] = np.nan

Ikon answered 12/3, 2014 at 14:40 Comment(6)

Using percentiles of the data as an outlier test is a reasonable first pass, but it's not ideal. The problem is 1) that you'll remove some data, even if it's not an outlier, and 2) the outliers heavily influence the variance, and therefore the percentile values. The most common outlier tests use "median absolute deviation" which is less sensitive to the presence of outliers. – Forrest 12/3, 2014 at 15:3

@Joe Kington I would be grateful if you could implement your way using python code. – Photomural 12/3, 2014 at 15:13

@Joe Kington i saw the link you answered. However, is not there a more easier way to make it primarily using the functions available in numpy – Photomural 12/3, 2014 at 15:27

@julie - That function extensively uses numpy (it requires a numpy array as input and outputs a numpy array). An outlier test is far beyond the scope of numpy. (numpy itself just contains the core data structure and a few basic operations. It's deliberately small.) You can make an argument that scipy.stats would be a reasonable place for an outlier test, but there are many of them, and there's no single best test. Therefore, there isn't currently a single-function outlier test. – Forrest 12/3, 2014 at 15:30

Statsmodels has a median-absolute deviation function in sm.robust.mad. I'm not sure there are facilities for univariate outlier tests, but there are for influence/outliers in a regression framework. Will see about adding some tools for univariate outlier detection. – Sternutatory 12/3, 2014 at 16:17

@Sternutatory - Didn't know that was there! Thanks! – Forrest 12/3, 2014 at 16:26

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags