Sub-quadratic algorithm for fitting a curve with two lines
Asked Answered
S

2

6

The problem is to find the best fit of a real-valued 2D curve (given by the set of points) with a polyline consisting of two lines.

A brute-force approach would be to find the "left" and "right" linear fits for each point of the curve and pick the pair with minimum error. I can calculate the two linear fits incrementally while iterating through the points of the curve, but I can't find a way to incrementally calculate the error. Thus this approach yields to a quadratic complexity.

The question is if there is an algorithm that will provide sub-quadratic complexity?

The second question is if there is a handy C++ library for such algorithms?


EDIT For fitting with a single line, there are formulas:

m = (Σxiyi - ΣxiΣyi/N) / (Σxi2 - (Σxi)2/N)
b = Σyi/N - m * Σxi/N

where m is the slope and b is the offset of the line. Having such a formula for the fit error would solve the problem in the best way.

Stolzer answered 19/6, 2020 at 21:30 Comment(10)
There's a formula for coming stdev/rms from the sums of the squares. Show an MCVE pleaseQuintessence
@MadPhysicist You mean calculating rms from the line without having the actual line?Stolzer
Use stdev x = sum(x_i^2)/N - mean(X)^2Quintessence
Sort of. All the quantities m, b, rms, etc, can be expressed in terms of sum(x), sum(x^2), NQuintessence
@MadPhysicist What is the formula for rms? Couldn't find it.Stolzer
I'll work out the equations for you if you like. Do you mind if I use Python notation? It should be pretty trivial to translate to C++Quintessence
Show what you have done with " I can calculate the two linear fits incrementally while iterating through the points of the curve,"Sufflate
@chux-ReinstateMonica That's already illustrated in the accepted answer. I was aware of m = (Σxiyi - ΣxiΣyi/N) / (Σxi2 - (Σxi)2/N) and b = Σyi/N - m * Σxi/N but I was unaware of e = Σyi2 + m2 * Σxi2 + N * b2 - m * Σxiyi - b * Σyi + m * b * Σxi. That's totally and in the best way answers my question.Stolzer
@Stolzer Formula used makes sense if y depends on x and the best fit reduces the error from (x, y) to (x,curve(x)). If x and y are independent, then a different error calculation is needed.Sufflate
@chux-ReinstateMonica There is just a set of points. No notion of dependency or independence between the coordinates.Stolzer
Q
3

Disclaimer: I don't feel like figuring out how to do this in C++, so I will use Python (numpy) notation. The concepts are completely transferable, so you should have no trouble translating back to the language of your choice.

Let's say that you have a pair of arrays, x and y, containing the data points, and that x is monotonically increasing. Let's also say that you will always select a partition point that leaves at least two elements in each partition, so the equations are solvable.

Now you can compute some relevant quantities:

N = len(x)

sum_x_left = x[0]
sum_x2_left = x[0] * x[0]
sum_y_left = y[0]
sum_y2_left = y[0] * y[0]
sum_xy_left = x[0] * y[0]

sum_x_right = x[1:].sum()
sum_x2_right = (x[1:] * x[1:]).sum()
sum_y_right = y[1:].sum()
sum_y2_right = (y[1:] * y[1:]).sum()
sum_xy_right = (x[1:] * y[1:]).sum()

The reason that we need these quantities (which are O(N) to initialize) is that you can use them directly to compute some well known formulae for the parameters of a linear regression. For example, the optimal m and b for y = m * x + b is given by

μx = Σxi/N
μy = Σyi/N
m = Σ(xi - μx)(yi - μy) / Σ(xi - μx)2
b = μy - m * μx

The sum of squared errors is given by

e = Σ(yi - m * xi - b)2

These can be expanded using simple algebra into the following:

m = (Σxiyi - ΣxiΣyi/N) / (Σxi2 - (Σxi)2/N)
b = Σyi/N - m * Σxi/N
e = Σyi2 + m2 * Σxi2 + N * b2 - 2 * m * Σxiyi - 2 * b * Σyi + 2 * m * b * Σxi

You can therefore loop over all the possibilities and record the minimal e:

for p in range(1, N - 3):
    # shift sums: O(1)
    sum_x_left += x[p]
    sum_x2_left += x[p] * x[p]
    sum_y_left += y[p]
    sum_y2_left += y[p] * y[p]
    sum_xy_left += x[p] * y[p]

    sum_x_right -= x[p]
    sum_x2_right -= x[p] * x[p]
    sum_y_right -= y[p]
    sum_y2_right -= y[p] * y[p]
    sum_xy_right -= x[p] * y[p]

    # compute err: O(1)
    n_left = p + 1
    slope_left = (sum_xy_left - sum_x_left * sum_y_left * n_left) / (sum_x2_left - sum_x_left * sum_x_left / n_left)
    intercept_left = sum_y_left / n_left - slope_left * sum_x_left / n_left
    err_left = sum_y2_left + slope_left * slope_left * sum_x2_left + n_left * intercept_left * intercept_left - 2 * (slope_left * sum_xy_left + intercept_left * sum_y_left - slope_left * intercept_left * sum_x_left)

    n_right = N - n_left
    slope_right = (sum_xy_right - sum_x_right * sum_y_right * n_right) / (sum_x2_right - sum_x_right * sum_x_right / n_right)
    intercept_right = sum_y_right / n_right - slope_right * sum_x_right / n_right
    err_right = sum_y2_right + slope_right * slope_right * sum_x2_right + n_right * intercept_right * intercept_right - 2 * (slope_right * sum_xy_right + intercept_right * sum_y_right - slope_right * intercept_right * sum_x_right)

    err = err_left + err_right
    if p == 1 || err < err_min
        err_min = err
        n_min_left = n_left
        n_min_right = n_right
        slope_min_left = slope_left
        slope_min_right = slope_right
        intercept_min_left = intercept_left
        intercept_min_right = intercept_right

There are probably other simplifications you can make, but this is sufficient to have an O(n) algorithm.

Quintessence answered 20/6, 2020 at 0:38 Comment(1)
The last three terms in the formula of e has to have a factor of 2. Can you please correct the answer?Stolzer
A
0

In case it helps here's some C code that I've used for this sort of thing. It adds little to what Mad Physicist said.

First off, a formula. If you fit a line y^ : x->a*x+b through some points, then the error is given by:

E = Sum{ sqr(y[i]-y^(x[i])) }/ N = Vy - Cxy*Cxy/Vx
where 
Vx is the variance of the xs
Vy that of the ys 
Cxy the covariance of the as and the ys

The code below uses a structure that holds the means, the variances, the covariance and the count.

The function moms_acc_pt() updates these when you add a new point. The function moms_line() returns a and b for the line, and the error as above. The fmax(0,) on the return is in case of near perfect fits where rounding error could send the (mathematically non-negative) result negative.

While it would be possible to have a function that removes a point from a momentsT, I find it easier deal with deciding which momentsT to add a point to by taking copies, accumulating the point in the copies, getting the lines and then keeping the copy for the side the point fits best in, and the original for the other

typedef struct
{   int n;      // number points
    double  xbar,ybar;  // means of x,y
    double  Vx, Vy;     // variances of x,y
    double  Cxy;        // covariance of x,y
}   momentsT;

// update the moments to include the point x,y
void    moms_acc_pt( momentsT* M, double x, double y)
{   M->n += 1;
double  f = 1.0/M->n;
double  dx = x-M->xbar;
double  dy = y-M->ybar;
    M->xbar += f*dx;
    M->ybar += f*dy;
double  g = 1.0 - f;
    M->Vx   = g*(M->Vx  + f*dx*dx);
    M->Cxy  = g*(M->Cxy + f*dx*dy);
    M->Vy   = g*(M->Vy  + f*dy*dy);
}

// return the moments for the combination of A and B (assumed disjoint)
momentsT    moms_combine( const momentsT* A, const momentsT* B)
{
momentsT    C;
    C.n = A->n + B->n;
double  alpha = (double)A->n/(double)C.n;
double  beta = (double)B->n/(double)C.n;
    C.xbar = alpha*A->xbar + beta*B->xbar;
    C.ybar = alpha*A->ybar + beta*B->ybar;
double  dx = A->xbar - B->xbar;
double  dy = A->ybar - B->ybar;
    C.Vx = alpha*A->Vx + beta*B->Vx + alpha*beta*dx*dx;
    C.Cxy= alpha*A->Cxy+ beta*B->Cxy+ alpha*beta*dx*dy;
    C.Vy = alpha*A->Vy + beta*B->Vy + alpha*beta*dy*dy;
    return C;
}

// line is y^ : x -> a*x + b; return Sum{ sqr( y[i] - y^(x[i])) }/N
double  moms_line( momentsT* M, double* a, double *b)
{   *a = M->Cxy/M->Vx;
    *b = M->ybar - *a*M->xbar;
    return fmax( 0.0, M->Vy - *a*M->Cxy);
}
Addlebrained answered 21/6, 2020 at 16:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.