Is there easy way in python to extrapolate data points to the future?
Asked Answered
S

4

11

I have a simple numpy array, for every date there is a data point. Something like this:

>>> import numpy as np
>>> from datetime import date
>>> from datetime import date
>>> x = np.array( [(date(2008,3,5), 4800 ), (date(2008,3,15), 4000 ), (date(2008,3,
20), 3500 ), (date(2008,4,5), 3000 ) ] )

Is there easy way to extrapolate data points to the future: date(2008,5,1), date(2008, 5, 20) etc? I understand it can be done with mathematical algorithms. But here I am seeking for some low hanging fruit. Actually I like what numpy.linalg.solve does, but it does not look applicable for the extrapolation. Maybe I am absolutely wrong.

Actually to be more specific I am building a burn-down chart (xp term): 'x=date and y=volume of work to be done', so I have got the already done sprints and I want to visualise how the future sprints will go if the current situation persists. And finally I want to predict the release date. So the nature of 'volume of work to be done' is it always goes down on burn-down charts. Also I want to get the extrapolated release date: date when the volume becomes zero.

This is all for showing to dev team how things go. The preciseness is not so important here :) The motivation of dev team is the main factor. That means I am absolutely fine with the very approximate extrapolation technique.

Shadshadberry answered 21/10, 2009 at 9:42 Comment(3)
When you googled for "statistics python" what did you find? Any questions on any of the statistical packages you found?Bilbo
It is hard to talk about any extrapolation, without knowing the nature of the data in question. The above, as far as one can see, could be anything (not excluding random values), so to talk about any practical approach would be just speculating. Refine the question.Contusion
you are absolutely right! refined.Shadshadberry
C
18

It's all too easy for extrapolation to generate garbage; try this. Many different extrapolations are of course possible; some produce obvious garbage, some non-obvious garbage, many are ill-defined.

alt text

""" extrapolate y,m,d data with scipy UnivariateSpline """
import numpy as np
from scipy.interpolate import UnivariateSpline
    # pydoc scipy.interpolate.UnivariateSpline -- fitpack, unclear
from datetime import date
from pylab import *  # ipython -pylab

__version__ = "denis 23oct"


def daynumber( y,m,d ):
    """ 2005,1,1 -> 0  2006,1,1 -> 365 ... """
    return date( y,m,d ).toordinal() - date( 2005,1,1 ).toordinal()

days, values = np.array([
    (daynumber(2005,1,1), 1.2 ),
    (daynumber(2005,4,1), 1.8 ),
    (daynumber(2005,9,1), 5.3 ),
    (daynumber(2005,10,1), 5.3 )
    ]).T
dayswanted = np.array([ daynumber( year, month, 1 )
        for year in range( 2005, 2006+1 )
        for month in range( 1, 12+1 )])

np.set_printoptions( 1 )  # .1f
print "days:", days
print "values:", values
print "dayswanted:", dayswanted

title( "extrapolation with scipy.interpolate.UnivariateSpline" )
plot( days, values, "o" )
for k in (1,2,3):  # line parabola cubicspline
    extrapolator = UnivariateSpline( days, values, k=k )
    y = extrapolator( dayswanted )
    label = "k=%d" % k
    print label, y
    plot( dayswanted, y, label=label  )  # pylab

legend( loc="lower left" )
grid(True)
savefig( "extrapolate-UnivariateSpline.png", dpi=50 )
show()

Added: a Scipy ticket says, "The behavior of the FITPACK classes in scipy.interpolate is much more complex than the docs would lead one to believe" -- imho true of other software doc too.

Cuvette answered 23/10, 2009 at 15:15 Comment(1)
Interpolating is not extrapolating, and the other way around.Kicksorter
S
4

A simple way of doing extrapolations is to use interpolating polynomials or splines: there are many routines for this in scipy.interpolate, and there are quite easy to use (just give the (x, y) points, and you get a function [a callable, precisely]).

Now, as as been pointed in this thread, you cannot expect the extrapolation to be always meaningful (especially when you are far from your data points) if you don't have a model for your data. However, I encourage you to play with the polynomial or spline interpolations from scipy.interpolate to see whether the results you obtain suit you.

Side answered 21/10, 2009 at 13:4 Comment(0)
P
3

The mathematical models are the way to go in this case. For instance, if you have only three data points, you can have absolutely no indication on how the trend will unfold (could be any of two parabola.)

Get some statistics courses and try to implement the algorithms. Try Wikibooks.

Parapsychology answered 21/10, 2009 at 9:47 Comment(1)
absolutely agree, do understand it but want to clarify, I am just checking if by some chance there is numpy.extrapolate function already in place, with argument "choose extrapolation method" :) That's why I call it "low hanging fruit"Shadshadberry
S
1

You have to swpecify over which function you need extrapolation. Than you can use regression http://en.wikipedia.org/wiki/Regression_analysis to find paratmeters of function. And extrapolate this in future.

For instance: translate dates into x values and use first day as x=0 for your problem the values shoul be aproximatly (0,1.2), (400,1.8),(900,5.3)

Now you decide that his points lies on function of type a+bx+cx^2

Use the method of least squers to find a,b and c http://en.wikipedia.org/wiki/Linear_least_squares (i will provide full source, but later, beacuase I do not have time for this)

Scudo answered 21/10, 2009 at 10:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.