Sub-quadratic algorithm for fitting a curve with two lines

Asked 19/6, 2020 at 21:30 Answered 21/6, 2020 at 16:33

Solved c++algorithm linear-regression curve-fitting

The problem is to find the best fit of a real-valued 2D curve (given by the set of points) with a polyline consisting of two lines.

A brute-force approach would be to find the "left" and "right" linear fits for each point of the curve and pick the pair with minimum error. I can calculate the two linear fits incrementally while iterating through the points of the curve, but I can't find a way to incrementally calculate the error. Thus this approach yields to a quadratic complexity.

The question is if there is an algorithm that will provide sub-quadratic complexity?

The second question is if there is a handy C++ library for such algorithms?

EDIT For fitting with a single line, there are formulas:

m = (Σxiyi - ΣxiΣyi/N) / (Σxi2 - (Σxi)2/N)
b = Σyi/N - m * Σxi/N

where m is the slope and b is the offset of the line. Having such a formula for the fit error would solve the problem in the best way.

Stolzer answered 19/6, 2020 at 21:30 Comment(10)

There's a formula for coming stdev/rms from the sums of the squares. Show an MCVE please – Quintessence 19/6, 2020 at 21:33

@MadPhysicist You mean calculating rms from the line without having the actual line? – Stolzer 19/6, 2020 at 21:43

Use stdev x = sum(x_i^2)/N - mean(X)^2 – Quintessence 19/6, 2020 at 21:45

Sort of. All the quantities m, b, rms, etc, can be expressed in terms of sum(x), sum(x^2), N – Quintessence 19/6, 2020 at 21:48

@MadPhysicist What is the formula for rms? Couldn't find it. – Stolzer 19/6, 2020 at 21:50

I'll work out the equations for you if you like. Do you mind if I use Python notation? It should be pretty trivial to translate to C++ – Quintessence 19/6, 2020 at 22:52

Show what you have done with " I can calculate the two linear fits incrementally while iterating through the points of the curve," – Sufflate 20/6, 2020 at 1:5

@chux-ReinstateMonica That's already illustrated in the accepted answer. I was aware of m = (Σxiyi - ΣxiΣyi/N) / (Σxi2 - (Σxi)2/N) and b = Σyi/N - m * Σxi/N but I was unaware of e = Σyi2 + m2 * Σxi2 + N * b2 - m * Σxiyi - b * Σyi + m * b * Σxi. That's totally and in the best way answers my question. – Stolzer 20/6, 2020 at 7:42

@Stolzer Formula used makes sense if y depends on x and the best fit reduces the error from (x, y) to (x,curve(x)). If x and y are independent, then a different error calculation is needed. – Sufflate 20/6, 2020 at 10:26

@chux-ReinstateMonica There is just a set of points. No notion of dependency or independence between the coordinates. – Stolzer 20/6, 2020 at 11:23

Disclaimer: I don't feel like figuring out how to do this in C++, so I will use Python (numpy) notation. The concepts are completely transferable, so you should have no trouble translating back to the language of your choice.

Let's say that you have a pair of arrays, x and y, containing the data points, and that x is monotonically increasing. Let's also say that you will always select a partition point that leaves at least two elements in each partition, so the equations are solvable.

Now you can compute some relevant quantities:

N = len(x)

sum_x_left = x[0]
sum_x2_left = x[0] * x[0]
sum_y_left = y[0]
sum_y2_left = y[0] * y[0]
sum_xy_left = x[0] * y[0]

sum_x_right = x[1:].sum()
sum_x2_right = (x[1:] * x[1:]).sum()
sum_y_right = y[1:].sum()
sum_y2_right = (y[1:] * y[1:]).sum()
sum_xy_right = (x[1:] * y[1:]).sum()

The reason that we need these quantities (which are O(N) to initialize) is that you can use them directly to compute some well known formulae for the parameters of a linear regression. For example, the optimal m and b for y = m * x + b is given by

μ_x = Σx_i/N
μ_y = Σy_i/N
m = Σ(x_i - μ_x)(y_i - μ_y) / Σ(x_i - μ_x)²
b = μ_y - m * μ_x

The sum of squared errors is given by

e = Σ(y_i - m * x_i - b)²

These can be expanded using simple algebra into the following:

m = (Σx_iy_i - Σx_iΣy_i/N) / (Σx_i² - (Σx_i)²/N)
b = Σy_i/N - m * Σx_i/N
e = Σy_i² + m² * Σx_i² + N * b² - 2 * m * Σx_iy_i - 2 * b * Σy_i + 2 * m * b * Σx_i

You can therefore loop over all the possibilities and record the minimal e:

for p in range(1, N - 3):
    # shift sums: O(1)
    sum_x_left += x[p]
    sum_x2_left += x[p] * x[p]
    sum_y_left += y[p]
    sum_y2_left += y[p] * y[p]
    sum_xy_left += x[p] * y[p]

    sum_x_right -= x[p]
    sum_x2_right -= x[p] * x[p]
    sum_y_right -= y[p]
    sum_y2_right -= y[p] * y[p]
    sum_xy_right -= x[p] * y[p]

    # compute err: O(1)
    n_left = p + 1
    slope_left = (sum_xy_left - sum_x_left * sum_y_left * n_left) / (sum_x2_left - sum_x_left * sum_x_left / n_left)
    intercept_left = sum_y_left / n_left - slope_left * sum_x_left / n_left
    err_left = sum_y2_left + slope_left * slope_left * sum_x2_left + n_left * intercept_left * intercept_left - 2 * (slope_left * sum_xy_left + intercept_left * sum_y_left - slope_left * intercept_left * sum_x_left)

    n_right = N - n_left
    slope_right = (sum_xy_right - sum_x_right * sum_y_right * n_right) / (sum_x2_right - sum_x_right * sum_x_right / n_right)
    intercept_right = sum_y_right / n_right - slope_right * sum_x_right / n_right
    err_right = sum_y2_right + slope_right * slope_right * sum_x2_right + n_right * intercept_right * intercept_right - 2 * (slope_right * sum_xy_right + intercept_right * sum_y_right - slope_right * intercept_right * sum_x_right)

    err = err_left + err_right
    if p == 1 || err < err_min
        err_min = err
        n_min_left = n_left
        n_min_right = n_right
        slope_min_left = slope_left
        slope_min_right = slope_right
        intercept_min_left = intercept_left
        intercept_min_right = intercept_right

There are probably other simplifications you can make, but this is sufficient to have an O(n) algorithm.

Quintessence answered 20/6, 2020 at 0:38 Comment(1)

The last three terms in the formula of e has to have a factor of 2. Can you please correct the answer? – Stolzer 26/11, 2021 at 7:57

In case it helps here's some C code that I've used for this sort of thing. It adds little to what Mad Physicist said.

First off, a formula. If you fit a line y^ : x->a*x+b through some points, then the error is given by:

E = Sum{ sqr(y[i]-y^(x[i])) }/ N = Vy - Cxy*Cxy/Vx
where 
Vx is the variance of the xs
Vy that of the ys 
Cxy the covariance of the as and the ys

The code below uses a structure that holds the means, the variances, the covariance and the count.

The function moms_acc_pt() updates these when you add a new point. The function moms_line() returns a and b for the line, and the error as above. The fmax(0,) on the return is in case of near perfect fits where rounding error could send the (mathematically non-negative) result negative.

While it would be possible to have a function that removes a point from a momentsT, I find it easier deal with deciding which momentsT to add a point to by taking copies, accumulating the point in the copies, getting the lines and then keeping the copy for the side the point fits best in, and the original for the other

typedef struct
{   int n;      // number points
    double  xbar,ybar;  // means of x,y
    double  Vx, Vy;     // variances of x,y
    double  Cxy;        // covariance of x,y
}   momentsT;

// update the moments to include the point x,y
void    moms_acc_pt( momentsT* M, double x, double y)
{   M->n += 1;
double  f = 1.0/M->n;
double  dx = x-M->xbar;
double  dy = y-M->ybar;
    M->xbar += f*dx;
    M->ybar += f*dy;
double  g = 1.0 - f;
    M->Vx   = g*(M->Vx  + f*dx*dx);
    M->Cxy  = g*(M->Cxy + f*dx*dy);
    M->Vy   = g*(M->Vy  + f*dy*dy);
}

// return the moments for the combination of A and B (assumed disjoint)
momentsT    moms_combine( const momentsT* A, const momentsT* B)
{
momentsT    C;
    C.n = A->n + B->n;
double  alpha = (double)A->n/(double)C.n;
double  beta = (double)B->n/(double)C.n;
    C.xbar = alpha*A->xbar + beta*B->xbar;
    C.ybar = alpha*A->ybar + beta*B->ybar;
double  dx = A->xbar - B->xbar;
double  dy = A->ybar - B->ybar;
    C.Vx = alpha*A->Vx + beta*B->Vx + alpha*beta*dx*dx;
    C.Cxy= alpha*A->Cxy+ beta*B->Cxy+ alpha*beta*dx*dy;
    C.Vy = alpha*A->Vy + beta*B->Vy + alpha*beta*dy*dy;
    return C;
}

// line is y^ : x -> a*x + b; return Sum{ sqr( y[i] - y^(x[i])) }/N
double  moms_line( momentsT* M, double* a, double *b)
{   *a = M->Cxy/M->Vx;
    *b = M->ybar - *a*M->xbar;
    return fmax( 0.0, M->Vy - *a*M->Cxy);
}

Addlebrained answered 21/6, 2020 at 16:33 Comment(0)

Recommended topics

Hot tags