Calculate mean and standard deviation from a vector of samples in C++ using Boost
Asked Answered
R

10

105

Is there a way to calculate mean and standard deviation for a vector containing samples using Boost?

Or do I have to create an accumulator and feed the vector into it?

Repartition answered 30/9, 2011 at 21:59 Comment(0)
M
58

Using accumulators is the way to compute means and standard deviations in Boost.

accumulator_set<double, stats<tag::variance> > acc;
for_each(a_vec.begin(), a_vec.end(), bind<void>(ref(acc), _1));

cout << mean(acc) << endl;
cout << sqrt(variance(acc)) << endl;

 

Megavolt answered 30/9, 2011 at 22:48 Comment(2)
Note, that tag::variance calculates variance by an approximate formula. tag::variance(lazy) calculates by an exact formula, specifically: second moment - squared mean which will produce incorrect result if variance is very small because of rounding errors. It can actually produce negative variance.Halftrack
Use the recursive (online) algorithm if you know you are going to have a lots of numbers. This will take care of both under and overflow problems.Exarchate
H
249

I don't know if Boost has more specific functions, but you can do it with the standard library.

Given std::vector<double> v, this is the naive way:

#include <numeric>

double sum = std::accumulate(v.begin(), v.end(), 0.0);
double mean = sum / v.size();

double sq_sum = std::inner_product(v.begin(), v.end(), v.begin(), 0.0);
double stdev = std::sqrt(sq_sum / v.size() - mean * mean);

This is susceptible to overflow or underflow for huge or tiny values. A slightly better way to calculate the standard deviation is:

double sum = std::accumulate(v.begin(), v.end(), 0.0);
double mean = sum / v.size();

std::vector<double> diff(v.size());
std::transform(v.begin(), v.end(), diff.begin(),
               std::bind2nd(std::minus<double>(), mean));
double sq_sum = std::inner_product(diff.begin(), diff.end(), diff.begin(), 0.0);
double stdev = std::sqrt(sq_sum / v.size());

UPDATE for C++11:

The call to std::transform can be written using a lambda function instead of std::minus and std::bind2nd(now deprecated):

std::transform(v.begin(), v.end(), diff.begin(), [mean](double x) { return x - mean; });
Hertzog answered 30/9, 2011 at 22:42 Comment(13)
Does the bottom section of code depend on top? I.e, is mean in the second section same as mean defined in first section?Glynn
Yes; obviously, the bottom part depends on the value of mean calculated in the top part.Hertzog
The first set of equations does not work. I put int 10 & 2, and got an output of 4. At a glance I think it's b/c it assumes that (a-b)^2 = a^2-b^2Legion
@CharlesL.: It should work, and 4 is the correct answer.Hertzog
@StudentT: No, but you can substitute (v.size() - 1) for v.size() in the last line above: std::sqrt(sq_sum / (v.size() - 1)). (For the first method, it's a little complicated: std::sqrt(sq_sum / (v.size() - 1) - mean * mean * v.size() / (v.size() - 1)).Hertzog
@JasonParham, @HumamHelfawi: Of course; that goes without saying. And the same for <cmath>, <functional>, <vector>, <algorithm>, etc.Hertzog
Using std::inner_product for sum of squares is very neat.Marilee
Is this guaranteed to use simd if -fopenmp is added? or is a for loop better for the accumulations?Nicoline
@bordeo: You may refer to STL algorithms and concurrent programming, and also Parallelized versions of existing algorithms.Hertzog
i can confirm firsthand that the first implementation does overflow/underflow for tiny numbers. i had to change to the second implementation and then i did not get NAN value for the standard deviation. The two extra lines of code are worth it to avoid overflow/underflow!Gassy
Computing the standard deviation from N samples requires an estimator, so a division by (N-1) elements, not N. The statistical reason for this is that one degree of freedom is already taken to compute the mean of N samples. So the correct formula for computing stdev in you last line should be: double stdev = std::sqrt( sq_sum / (v.size()-1) ); which could make quite a difference for small N.Sleek
@CharlesL.: You will probably be confused if you just "take a glance". Grab a paper and do some basic math calculation, you would eventually come up to the same formula stated in the first set.Refugee
I probably should not have speculated on why, but the answer I got was not what I expected. As I think @Hertzog alluded to in reply to @StudentT, turns out there are two types of standard deviations. This code did not return what I expected. For [10, 2], the population standard dev is 4, but the sample standard dev is 5.66. The difference between the two is dividing by n vs n-1. This returns the population standard dev, so if this detail matters, make sure to adjust the code accordinglyLegion
B
84

If performance is important to you, and your compiler supports lambdas, the stdev calculation can be made faster and simpler: In tests with VS 2012 I've found that the following code is over 10 X quicker than the Boost code given in the chosen answer; it's also 5 X quicker than the safer version of the answer using standard libraries given by musiphil.

Note I'm using sample standard deviation, so the below code gives slightly different results (Why there is a Minus One in Standard Deviations)

double sum = std::accumulate(std::begin(v), std::end(v), 0.0);
double m =  sum / v.size();

double accum = 0.0;
std::for_each (std::begin(v), std::end(v), [&](const double d) {
    accum += (d - m) * (d - m);
});

double stdev = sqrt(accum / (v.size()-1));
Biparous answered 13/9, 2012 at 12:1 Comment(8)
Thanks for sharing this answer even a year later. Now I come another year later and made this one generic for both the value type and the container type. See here (Note: I guess that my range-based for loop is as fast as your lambda code.)Cumae
what is the difference between using std::end(v) instead of v.end()?Astromancy
The std::end() function was added by C++11 standard for cases when there is nothing like v.end(). The std::end can be overloaded for the less standard container -- see en.cppreference.com/w/cpp/iterator/endPaphian
Can you explain why is this faster?Parsons
@Parsons Without being facetious, it's because it has a lot less code in it - trace the Boost code as it runs and you'll see. Algorithmically they're all pretty much the same. The reason it's faster than musiphil's answer is the same as for Boost -- an average compiler has no trouble optimising the lambda call away in my Answer, and then removing the invariants and temporaries in the inner loop.Biparous
I was actually trying to understand why it's 5x faster than the safer answer. Your answer is easy to understand, and with my limited knowledge of statistics I know it will work. I know the time complexity is the same. When you say faster than musiphil's safer answer, are you referring to his non-naive (2nd) answer? Or is it faster than the naive(1st) answer?Parsons
Well for one thing, the "safe" answer (which is like my answer) makes 3 passes through the array: Once for the sum, once for the diff-mean, and once for the squaring. In my code there's only 2 passes -- It's conflating the second two passes into one. And (when I last looked, quite a while ago now!) the inner_product calls were not optimized away. In addition the "safe" code copies v into an entirely new array of diffs, which adds more delay. In my opinion my code is more readable too - and is easily ported to JavaScript and other languages :)Biparous
@Cumae Thank you for sharing! Just one correction, the code you shared computes the standard deviation. Not the variance. The function name should reflected that.Parrott
M
58

Using accumulators is the way to compute means and standard deviations in Boost.

accumulator_set<double, stats<tag::variance> > acc;
for_each(a_vec.begin(), a_vec.end(), bind<void>(ref(acc), _1));

cout << mean(acc) << endl;
cout << sqrt(variance(acc)) << endl;

 

Megavolt answered 30/9, 2011 at 22:48 Comment(2)
Note, that tag::variance calculates variance by an approximate formula. tag::variance(lazy) calculates by an exact formula, specifically: second moment - squared mean which will produce incorrect result if variance is very small because of rounding errors. It can actually produce negative variance.Halftrack
Use the recursive (online) algorithm if you know you are going to have a lots of numbers. This will take care of both under and overflow problems.Exarchate
N
8

Improving on the answer by musiphil, you can write a standard deviation function without the temporary vector diff, just using a single inner_product call with the C++11 lambda capabilities:

double stddev(std::vector<double> const & func)
{
    double mean = std::accumulate(func.begin(), func.end(), 0.0) / func.size();
    double sq_sum = std::inner_product(func.begin(), func.end(), func.begin(), 0.0,
        [](double const & x, double const & y) { return x + y; },
        [mean](double const & x, double const & y) { return (x - mean)*(y - mean); });
    return std::sqrt(sq_sum / func.size() - 1);
}

I suspect doing the subtraction multiple times is cheaper than using up additional intermediate storage, and I think it is more readable, but I haven't tested the performance yet.

As for explanation why to use N-1 (as in func.size() - 1), see these questions - note how the question states we have a "vector containing samples".

Newish answered 13/8, 2018 at 13:31 Comment(5)
I think this is computing the variance, not the standard deviation.Succory
The std deviation is calculed dividing by N and not by N-1. Why do you divide the sq_sum by func.size()-1?J
Why N-1? You divide by N when computing a population std dev, and divide by N-1 when computing a sample std dev. stats.stackexchange.com/q/3931 stats.stackexchange.com/q/485326Bluma
@JonathonReinhart you mean to say initial computation using N-1 was correct, right?Newish
I'm saying that they are both correct, depending on what you are calculating.Bluma
W
7

It seems the following elegant recursive solution has not been mentioned, although it has been around for a long time. Referring to Knuth's Art of Computer Programming,

mean_1 = x_1, variance_1 = 0;            //initial conditions; edge case;

//for k >= 2, 
mean_k     = mean_k-1 + (x_k - mean_k-1) / k;
variance_k = variance_k-1 + (x_k - mean_k-1) * (x_k - mean_k);

then for a list of n>=2 values, the estimate of the standard deviation is:

stddev = std::sqrt(variance_n / (n-1)). 

Hope this helps!

Wont answered 14/3, 2019 at 6:4 Comment(1)
This is pretty cool. I implemented it with an index loop ( pastebin.com/aRd1ChjD ), but it runs three times slower than the stl based solution.Rodolforodolph
F
1

My answer is similar as Josh Greifer but generalised to sample covariance. Sample variance is just sample covariance but with the two inputs identical. This includes Bessel's correlation.

    template <class Iter> typename Iter::value_type cov(const Iter &x, const Iter &y)
    {
        double sum_x = std::accumulate(std::begin(x), std::end(x), 0.0);
        double sum_y = std::accumulate(std::begin(y), std::end(y), 0.0);

        double mx =  sum_x / x.size();
        double my =  sum_y / y.size();

        double accum = 0.0;

        for (auto i = 0; i < x.size(); i++)
        {
            accum += (x.at(i) - mx) * (y.at(i) - my);
        }

        return accum / (x.size() - 1);
    }
Fer answered 22/4, 2015 at 12:38 Comment(0)
H
0

2x faster than the versions before mentioned - mostly because transform() and inner_product() loops are joined. Sorry about my shortcut/typedefs/macro: Flo = float. CR const ref. VFlo - vector. Tested in VS2010

#define fe(EL, CONTAINER)   for each (auto EL in CONTAINER)  //VS2010
Flo stdDev(VFlo CR crVec) {
    SZ  n = crVec.size();               if (n < 2) return 0.0f;
    Flo fSqSum = 0.0f, fSum = 0.0f;
    fe(f, crVec) fSqSum += f * f;       // EDIT: was Cit(VFlo, crVec) {
    fe(f, crVec) fSum   += f;
    Flo fSumSq      = fSum * fSum;
    Flo fSumSqDivN  = fSumSq / n;
    Flo fSubSqSum   = fSqSum - fSumSqDivN;
    Flo fPreSqrt    = fSubSqSum / (n - 1);
    return sqrt(fPreSqrt);
}
Humiliating answered 25/10, 2017 at 2:46 Comment(4)
Can the Cit() loop be written as for( float f : crVec ) { fSqSum += f * f; fSum += f; } ?Towe
Yes in C++11. Trying to use macros that make it version independent. Updated the code. PS. For readability I usually prefer 1 action per LOC. Compiler should see that those are constant iterations and join them if it "thinks" it's faster to iterate once. Doing it in small short steps (without using std::inner_product() e.g.), kind of assembly-style, explains to new reader what it means. Binary will be smaller by side-effect (in some cases).Humiliating
"Trying to use macros that make it version independent" - yet you limit yourself to the non-standard Visual C++ "for each" construct (#197875)Newish
@Newish It's just 1 macro for an illustration for 1 version of C++ for that post only. The was the algorithm - not coding std. Back then I used even uglier Cit(CFlo, crVec), which had default const-iter "cit", but re-indicates the container type. List of all compiler/OS-specific macros, is good when portability is in questions. In examples with boost it's also not easy to port it to std C++. I didn't explain the ugly short Flo, VFlo, CR, SZ neither -> float, vector<float>, const&, size - for shorten iteration lines of std C++. Same style Crit(MSZPFlo, crMap) foo(*crit.second); //rev-iterHumiliating
O
0

In order to calculate the sample mean with a better presicion the following r-step recursion can be used:

mean_k=1/k*[(k-r)*mean_(k-r) + sum_over_i_from_(n-r+1)_to_n(x_i)],

where r is chosen to make summation components closer to each other.

Olympias answered 24/6, 2021 at 13:37 Comment(0)
Q
-4

Create your own container:

template <class T>
class statList : public std::list<T>
{
    public:
        statList() : std::list<T>::list() {}
        ~statList() {}
        T mean() {
           return accumulate(begin(),end(),0.0)/size();
        }
        T stddev() {
           T diff_sum = 0;
           T m = mean();
           for(iterator it= begin(); it != end(); ++it)
               diff_sum += ((*it - m)*(*it -m));
           return diff_sum/size();
        }
};

It does have some limitations, but it works beautifully when you know what you are doing.

Quintessence answered 8/8, 2016 at 22:50 Comment(2)
To answer the question: because there’s absolutely no need. Creating your own container has absolutely no benefits compared to writing a free function.Karole
I don't even know where to start with this. You're using a list as the underlying data structure, you don't even cache the values, which would be one of the few reasons I can think of to use a container-like structure. Especially if the values chance infrequently and the mean/stddev are needed often.Alcmene
B
-7

//means deviation in c++

/A deviation that is a difference between an observed value and the true value of a quantity of interest (such as a population mean) is an error and a deviation that is the difference between the observed value and an estimate of the true value (such an estimate may be a sample mean) is a residual. These concepts are applicable for data at the interval and ratio levels of measurement./

#include <iostream>
#include <conio.h>
using namespace std;

/* run this program using the console pauser or add your own getch,     system("pause") or input loop */

int main(int argc, char** argv)
{
int i,cnt;
cout<<"please inter count:\t";
cin>>cnt;
float *num=new float [cnt];
float   *s=new float [cnt];
float sum=0,ave,M,M_D;

for(i=0;i<cnt;i++)
{
    cin>>num[i];
    sum+=num[i];    
}
ave=sum/cnt;
for(i=0;i<cnt;i++)
{
s[i]=ave-num[i];    
if(s[i]<0)
{
s[i]=s[i]*(-1); 
}
cout<<"\n|ave - number| = "<<s[i];  
M+=s[i];    
}
M_D=M/cnt;
cout<<"\n\n Average:             "<<ave;
cout<<"\n M.D(Mean Deviation): "<<M_D;
getch();
return 0;

}

Boldt answered 7/8, 2016 at 8:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.