How to fit the 2D scatter data with a line with C++

B

7

10

I used to work with MATLAB, and for the question I raised I can use p = polyfit(x,y,1) to estimate the best fit line for the scatter data in a plate. I was wondering which resources I can rely on to implement the line fitting algorithm with C++. I understand there are a lot of algorithms for this subject, and for me I expect the algorithm should be fast and meantime it can obtain the comparable accuracy of polyfit function in MATLAB.

Buggy answered 12/7, 2012 at 10:11 Comment(0)

S

7

I would suggest coding it from scratch. It is a very simple implementation in C++. You can code up both the intercept and gradient for least-squares fit (the same method as polyfit) from your data directly from the formulas here

http://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line

These are closed form formulas that you can easily evaluate yourself using loops. If you were using higher degree fits then I would suggest a matrix library or more sophisticated algorithms but for simple linear regression as you describe above this is all you need. Matrices and linear algebra routines would be overkill for such a problem (in my opinion).

Seigniory answered 12/7, 2012 at 10:14 Comment(1)

Hmm, how does Simple Linear regression deal with data that's (almost) exactly parallel to the vertical (y) axis. It seems that non-injective functions will give issues in case of simple linear regression. In case of arbitrary scattered 2D linear data, this can, however, be possible. – Mckinnon 4/8, 2015 at 12:19

P

13

This page describes the algorithm easier than Wikipedia, without extra steps to calculate the means etc. : http://faculty.cs.niu.edu/~hutchins/csci230/best-fit.htm . Almost quoted from there, in C++ it's:

#include <vector>
#include <cmath>

struct Point {
  double _x, _y;
};
struct Line {
  double _slope, _yInt;
  double getYforX(double x) {
    return _slope*x + _yInt;
  }
  // Construct line from points
  bool fitPoints(const std::vector<Point> &pts) {
    int nPoints = pts.size();
    if( nPoints < 2 ) {
      // Fail: infinitely many lines passing through this single point
      return false;
    }
    double sumX=0, sumY=0, sumXY=0, sumX2=0;
    for(int i=0; i<nPoints; i++) {
      sumX += pts[i]._x;
      sumY += pts[i]._y;
      sumXY += pts[i]._x * pts[i]._y;
      sumX2 += pts[i]._x * pts[i]._x;
    }
    double xMean = sumX / nPoints;
    double yMean = sumY / nPoints;
    double denominator = sumX2 - sumX * xMean;
    // You can tune the eps (1e-7) below for your specific task
    if( std::fabs(denominator) < 1e-7 ) {
      // Fail: it seems a vertical line
      return false;
    }
    _slope = (sumXY - sumX * yMean) / denominator;
    _yInt = yMean - _slope * xMean;
    return true;
  }
};

Please, be aware that both this algorithm and the algorithm from Wikipedia ( http://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line ) fail in case the "best" description of points is a vertical line. They fail because they use

y = k*x + b

line equation which intrinsically is not capable to describe vertical lines. If you want to cover also the cases when data points are "best" described by vertical lines, you need a line fitting algorithm which uses

A*x + B*y + C = 0

line equation. You can still modify the current algorithm to produce that equation:

y = k*x + b <=>
y - k*x - b = 0 <=>
B=1, A=-k, C=-b

In terms of the above code:

B=1, A=-_slope, C=-_yInt

And in "then" block of the if checking for denominator equal to 0, instead of // Fail: it seems a vertical line, produce the following line equation:

x = xMean <=>
x - xMean = 0 <=>
A=1, B=0, C=-xMean

I've just noticed that the original article I was referring to has been deleted. And this web page proposes a little different formula for line fitting: http://hotmath.com/hotmath_help/topics/line-of-best-fit.html

double denominator = sumX2 - 2 * sumX * xMean + nPoints * xMean * xMean;
...
_slope = (sumXY - sumY*xMean - sumX * yMean + nPoints * xMean * yMean) / denominator;

The formulas are identical because nPoints*xMean == sumX and nPoints*xMean*yMean == sumX * yMean == sumY * xMean.

Philoctetes answered 25/1, 2015 at 9:3 Comment(5)

When x=[0..n], sumX is ( 0.5 * n + 0.5 ) * n and sumX2 is ( ( 1/3.0 * n + 1/2.0 ) * n + 1/6.0 ) * n. – Equivocal 16/1, 2016 at 3:49

@KhouriGiordano , could you elaborate on the utility of those observations? – Philoctetes 16/1, 2016 at 10:49

For data that is regularly spaced in the X dimension beginning with X = 0, rather than summing X and X squared, they can be directly calculated. E.g. pixel values have x = location and y = intensity. – Equivocal 20/1, 2016 at 23:38

Do you have a good reference for the linear-regression formula that produces the values for A, B, and C for the line equation "Ax + By + C = 0"? I'm having more trouble than I expected finding it. – Edva 28/4, 2016 at 22:28

@ulatekh, I've edited the answer to explain how to get A*x + B*y + C = 0 equation and how to handle the case of x=Const (vertical) line. – Philoctetes 29/4, 2016 at 6:26

S

7

I would suggest coding it from scratch. It is a very simple implementation in C++. You can code up both the intercept and gradient for least-squares fit (the same method as polyfit) from your data directly from the formulas here

http://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line

These are closed form formulas that you can easily evaluate yourself using loops. If you were using higher degree fits then I would suggest a matrix library or more sophisticated algorithms but for simple linear regression as you describe above this is all you need. Matrices and linear algebra routines would be overkill for such a problem (in my opinion).

Seigniory answered 12/7, 2012 at 10:14 Comment(1)

Hmm, how does Simple Linear regression deal with data that's (almost) exactly parallel to the vertical (y) axis. It seems that non-injective functions will give issues in case of simple linear regression. In case of arbitrary scattered 2D linear data, this can, however, be possible. – Mckinnon 4/8, 2015 at 12:19

S

5

Equation of line is Ax + By + C=0.

So it can be easily( when B is not so close to zero ) convert to y = (-A/B)*x + (-C/B)

typedef double scalar_type;
typedef std::array< scalar_type, 2 > point_type;
typedef std::vector< point_type > cloud_type;

bool fit( scalar_type & A, scalar_type & B, scalar_type & C, cloud_type const& cloud )
{
    if( cloud.size() < 2 ){ return false; }

    scalar_type X=0, Y=0, XY=0, X2=0, Y2=0;

    for( auto const& point: cloud )
    { // Do all calculation symmetric regarding X and Y
        X  += point[0];
        Y  += point[1];
        XY += point[0] * point[1];
        X2 += point[0] * point[0];
        Y2 += point[1] * point[1];
    }

    X  /= cloud.size();
    Y  /= cloud.size();
    XY /= cloud.size();
    X2 /= cloud.size();
    Y2 /= cloud.size();

    A = - ( XY - X * Y ); //!< Common for both solution

    scalar_type Bx = X2 - X * X;
    scalar_type By = Y2 - Y * Y;

    if( fabs( Bx ) < fabs( By ) ) //!< Test verticality/horizontality
    { // Line is more Vertical.
        B = By;
        std::swap(A,B);
    }
    else
    {   // Line is more Horizontal.
        // Classical solution, when we expect more horizontal-like line
        B = Bx;
    }
    C = - ( A * X + B * Y );

    //Optional normalization:
    // scalar_type  D = sqrt( A*A + B*B );
    // A /= D;
    // B /= D;
    // C /= D;
    return true;
}

Sordello answered 1/7, 2016 at 8:38 Comment(0)

F

1

You can also use or go over this implementation there is also documentation here.

Falzetta answered 12/7, 2012 at 10:31 Comment(0)

T

1

Fitting a Line can be acomplished in different ways. Least Square means minimizing the sum of the squared distance. But you could take another cost function as example the (not squared) distance. But normaly you use the squred distance (Least Square). There is also a possibility to define the distance in different ways. Normaly you just use the "y"-axis for the distance. But you could also use the total/orthogonal distance. There the distance is calculated in x- and y-direction. This can be a better fit if you have also errors in x direction (let it be the time of measurment) and you didn't start the measurment on the exact time you saved in the data. For Least Square and Total Least Square Line fit exist algorithms in closed form. So if you fitted with one of those you will get the line with the minimal sum of the squared distance to the datapoints. You can't fit a better line in the sence of your defenition. You could just change the definition as examples taking another cost function or defining distance in another way.

There is a lot of stuff about fitting models into data you could think of, but normaly they all use the "Least Square Line Fit" and you should be fine most times. But if you have a special case it can be necessary to think about what your doing. Taking Least Square done in maybe a few minutes. Thinking about what Method fits you best to the problem envolves understanding the math, which can take indefinit time :-).

Trigraph answered 21/1, 2015 at 17:17 Comment(0)

C

1

Note: This answer is NOT AN ANSWER TO THIS QUESTION but to this one "Line closest to a set of points" that has been flagged as "duplicate" of this one (incorrectly in my opinion), no way to add new answers to it.

The question asks for:

Find the line whose distance from all the points is minimum ? By distance I mean the shortest distance between the point and the line.

The most usual interpretation of distance "between the point and the line" is the euclidean distance and the most common interpretation of "from all points" is the sum of distances (in absolute or squared value).

When the target is minimize the sum of squared euclidean distances, the linear regression (LST) is not the algorithm to use. In addition, linear regression can not result in a vertical line. The algorithm to be used is the "total least squares". See by example wikipedia for the problem description and this answer in math stack exchange for details about the formulation.

Chari answered 11/2, 2019 at 21:4 Comment(1)

I concur that this is an entirely different question. – Blockhead 4/10, 2019 at 14:47

U

0

to fit a line y=param[0]x+param[1] simply do this:

// loop over data:
{               
sum_x += x[i];
sum_y += y[i];
sum_xy += x[i] * y[i];
sum_x2 += x[i] * x[i];
}

// means
double mean_x = sum_x / ninliers;
double mean_y = sum_y / ninliers;

float varx = sum_x2 - sum_x * mean_x;
float cov = sum_xy - sum_x * mean_y;

// check for zero varx

param[0] = cov / varx;
param[1] = mean_y - param[0] * mean_x;

More on the topic http://easycalculation.com/statistics/learn-regression.php (formulas are the same, they just multiplied and divided by N, a sample sz.). If you want to fit plane to 3D data use a similar approach - http://www.mymathforum.com/viewtopic.php?f=13&t=8793

Disclaimer: all quadratic fits are linear and optimal in a sense that they reduce the noise in parameters. However, you might interested in the reducing noise in the data instead. You might also want to ignore outliers since they can bia s your solutions greatly. Both problems can be solved with RANSAC. See my post at:

Unending answered 5/4, 2013 at 17:31 Comment(0)

Recommended topics

Hot tags