Weighted standard deviation in NumPy
Asked Answered
B

7

122

numpy.average() has a weights option, but numpy.std() does not. Does anyone have suggestions for a workaround?

Belgravia answered 9/3, 2010 at 23:53 Comment(2)
Btw, calculation of weighted std dev is actually a rather complex subject -- there's more than one way to do it. See here for a great discussion: stata.com/support/faqs/statistics/…Waechter
ccgalberta.com/pygeostat/statistics.html#weighted-statisticsAutocade
G
189

How about the following short "manual calculation"?

def weighted_avg_and_std(values, weights):
    """
    Return the weighted average and standard deviation.

    They weights are in effect first normalized so that they 
    sum to 1 (and so they must not all be 0).

    values, weights -- NumPy ndarrays with the same shape.
    """
    average = numpy.average(values, weights=weights)
    # Fast and numerically precise:
    variance = numpy.average((values-average)**2, weights=weights)
    return (average, math.sqrt(variance))

    
Globoid answered 10/3, 2010 at 8:7 Comment(8)
Why not use numpy.average again for the variance?Strip
Just wanted to point out that this will give the biased variance. For small sample sizes, you may want to re-scale the variance (before sqrt) to get the unbiased variance. See en.wikipedia.org/wiki/…Marinamarinade
Yeah, the unbiased variance estimator would be slightly different. This answer gives the standard deviation, since the question asks for a weighted version of numpy.std().Globoid
thx for this solution... but why do you use math.sqrt instead of np.sqrt in the end?Begird
np.sqrt() would work, but because variance is a simple (Numpy) float (and not a NumPy array), math.sqrt() is more explicit and appropriate (and therefore in general faster, if this matters).Globoid
I wanted to object to @EricO.Lebigot, but the difference is indeed staggering: %timeit math.sqrt(1.5) 61.6 ns ± 0.661 ns per loop, %timeit np.sqrt(1.5) 554 ns ± 14.5 ns per loop. math.sqrt is almost a factor 10 faster (Python 3.10.9, numpy v1.23.3).Sire
The computation for weighted variance is wrong. I wouldn't use np.average for weighted variance because "variance = numpy.average((values-average)**2, weights=sample_weights**2)" assuming sum(sample_weights) == 1. But np.average internally adjusts weights so that sum(weights) == 1. Thus, np.average can't be used for weighted variance.Kea
This is an interesting point. It is quite safe to assume that concrete cases want an average and variance calculated with normalized weights (otherwise they are not very much like an average or an average variance). I did make this explicit in the docstring. Thanks!Globoid
T
55

There is a class in statsmodels that makes it easy to calculate weighted statistics: statsmodels.stats.weightstats.DescrStatsW.

Assuming this dataset and weights:

import numpy as np
from statsmodels.stats.weightstats import DescrStatsW

array = np.array([1,2,1,2,1,2,1,3])
weights = np.ones_like(array)
weights[3] = 100

You initialize the class (note that you have to pass in the correction factor, the delta degrees of freedom at this point):

weighted_stats = DescrStatsW(array, weights=weights, ddof=0)

Then you can calculate:

  • .mean the weighted mean:

    >>> weighted_stats.mean      
    1.97196261682243
    
  • .std the weighted standard deviation:

    >>> weighted_stats.std       
    0.21434289609681711
    
  • .var the weighted variance:

    >>> weighted_stats.var       
    0.045942877107170932
    
  • .std_mean the standard error of weighted mean:

    >>> weighted_stats.std_mean  
    0.020818822467555047
    

    Just in case you're interested in the relation between the standard error and the standard deviation: The standard error is (for ddof == 0) calculated as the weighted standard deviation divided by the square root of the sum of the weights minus 1 (corresponding source for statsmodels version 0.9 on GitHub):

    standard_error = standard_deviation / sqrt(sum(weights) - 1)
    
Tufthunter answered 7/4, 2016 at 0:57 Comment(1)
To use this approach to easily calculate the weighted coefficient of variation, see this answer.Aminoplast
L
45

Here's one more option:

np.sqrt(np.cov(values, aweights=weights))
Lulualaba answered 4/10, 2018 at 21:15 Comment(2)
This is the shortest and most elegant of the given solutionsBisexual
This is indeed clean. It is worth noting, though, that this solution is restricted to 1D weights and values and is therefore less general than the accepted answer. The two calculations agree, on 1D data, if np.cov is called with ddof=0.Globoid
T
6

There doesn't appear to be such a function in numpy/scipy yet, but there is a ticket proposing this added functionality. Included there you will find Statistics.py which implements weighted standard deviations.

Trainer answered 10/3, 2010 at 0:6 Comment(0)
T
2

There is a very good example proposed by gaborous:

import pandas as pd
import numpy as np
# X is the dataset, as a Pandas' DataFrame
# Compute the weighted sample mean (fast, efficient and precise)
mean = np.ma.average(X, axis=0, weights=weights) 

# Convert to a Pandas' Series (it's just aesthetic and more 
# ergonomic; no difference in computed values)
mean = pd.Series(mean, index=list(X.keys())) 
xm = X-mean # xm = X diff to mean
# fill NaN with 0 
# a variance of 0 is just void, but at least it keeps the other
# covariance's values computed correctly))
xm = xm.fillna(0) 
# Compute the unbiased weighted sample covariance
sigma2 = 1./(w.sum()-1) * xm.mul(w, axis=0).T.dot(xm); 

Correct equation for weighted unbiased sample covariance, URL (version: 2016-06-28)

Theressa answered 9/11, 2017 at 16:20 Comment(1)
The origin of the calculation in the final line could do with some explanation.Trepang
L
1

A follow-up to "sample" or "unbiased" standard deviation in the "frequency weights" sense since "weighted sample standard deviation python" Google search leads to this post:

def frequency_sample_std_dev(X, n):
    """
    Sample standard deviation for X and n,
    where X[i] is the quantity each person in group i has,
    and n[i] is the number of people in group i.
    See Equation 6.4 of:
    Montgomery, Douglas, C. and George C. Runger. Applied Statistics 
     and Probability for Engineers, Enhanced eText. Available from: 
      WileyPLUS, (7th Edition). Wiley Global Education US, 2018.
    """
    n_groups = len(n)
    n_people = sum(n)
    lhs_numerator = sum([ni*Xi**2 for Xi, ni in zip(X, n)])
    rhs_numerator = sum([Xi*ni for Xi, ni in zip(X,n)])**2/n_people
    denominator = n_people-1
    var = (lhs_numerator - rhs_numerator) / denominator
    std = sqrt(var)
    return std

Or modifying the answer by @Eric as follows:

def weighted_sample_avg_std(values, weights):
    """
    Return the weighted average and weighted sample standard deviation.

    values, weights -- Numpy ndarrays with the same shape.
    
    Assumes that weights contains only integers (e.g. how many samples in each group).
    
    See also https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Frequency_weights
    """
    average = np.average(values, weights=weights)
    variance = np.average((values-average)**2, weights=weights)
    variance = variance*sum(weights)/(sum(weights)-1)
    return (average, sqrt(variance))

print(weighted_sample_avg_std(X, n))
Lehman answered 18/9, 2021 at 0:33 Comment(2)
Thanks for this nice answer! However, for your second function weighted_sample_avg_std(), on the third line where you have the second part of the variance equation, the variance is not supposed to be multiplied by a ratio of the sums but by a ratio of the number of non-zeros weights (itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf).Sorilda
Hmm.. that's a good point. Would you mind suggesting an edit? I looked into this previously (but after you made the comment), but the actual change wasn't obvious to me.Lehman
E
0

I was just searching for an API equivalent of the numpy np.std function that also allows the axis parameter to be set:

(I just tested it with two dimensions, so feel free for improvements if something is incorrect.)

def std(values, weights=None, axis=None):
    """
    Return the weighted standard deviation.
    axis -- the axis for std calculation
    values, weights -- Numpy ndarrays with the same shape on the according axis.
    """
    average = np.expand_dims(np.average(values, weights=weights, axis=axis), axis=axis)
    # Fast and numerically precise:
    variance = np.average((values-average)**2, weights=weights, axis=axis)
    return np.sqrt(variance)

Thanks to Eric O Lebigot for the original answer.

Erkan answered 8/11, 2022 at 18:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.