linear regression for timeseries python (numpy or pandas)
Asked Answered
L

1

8

I am new to python and programming in general, so forgive any simple mistakes/ things that should be obvious.

What I am trying to do is quite simple, I just want to fit a linear trend (1-d polynomial) to a bunch of time-series to see whether the slopes are positive or negative. Right now I am just trying to get it to work for one time series.

The problem: it seems like both pandas and numpy can't do regressions for datetimes. My date times are not regular (generally 1 day per month but not the same day) so can't use the suggestion posed in Linear Regression from Time Series Pandas

My time series csv looks like:

StationName,    year,   month,  day,    depth,  NO3-N,  PO4-P,  TotP,   TotN,
Kvarnbacken (Savaran),  2003,   2,  25, 0.5,    46, 9,  14, 451
Kvarnbacken (Savaran),  2003,   3,  18, 0.5,    64, 15, 17, 310
Kvarnbacken (Savaran),  2003,   3,  31, 0.5,    76, 7,  19, 566

so far what i have is

import datetime as dt
from scipy import stats
import numpy as np

# read in station csv file
data = pd.read_csv('Kvarnbacken (Savaran)_2003.csv')
data.head()
# set up dates to something python can recognize
data['date'] = pd.to_datetime(data.year*10000+data.month *
                          100+data.day, format='%Y%m%d')

I tried

slope, intercept, r_value, p_value, std_err = stats.linregress(data.date,
                                                               data.TotP)

and got the error TypeError: ufunc add cannot use operands with types dtype('

I also tried

coefP = np.polyfit(data.date, data.TotP, 1)
polyP = np.poly1d(coefP)
ys = polyP(data.date)
print 'For P: coef, poly'
print coefP
print polyP

and got the same error.

I am guessing the easiest way around this is to do something where I just count the days since the first measurement I have and then just do a regression with days_since to the total phosphorous concentration (totP) but I am not sure of the easiest way to do that or if there was another trick.

Legroom answered 1/9, 2015 at 8:58 Comment(3)
What you are doing seems wrong to me as the gap between two days is not constant (exple: between the 31st of Dec and the 1st of Jan). You could say day 0 is your first date. Then substract this first date to every date and convert it in days.Davisdavison
Hugo you are definitely correct. I didn't really think through my quick fix.Legroom
Comment #1 is not a great approach. If you want to do something like that you ought to convert it into some common units. More like 365*year + 30*month + days. Although even that is not ideal because years and months don't have a constant number of days. See answer for a better way to do this.Mickimickie
M
12

You could convert the datetime to days in the following way.

data['days_since'] = (data.date - pd.to_datetime('2003-02-25') ).astype('timedelta64[D]')

        date  days_since
0 2003-02-25           0
1 2003-03-18          21
2 2003-03-31          34

Now you should be able to regress as you did above.

slope, intercept, r_value, p_value, std_err = stats.linregress(data.days_since, 
                                                               data.TotP)
slope, intercept
(0.1466591166477916, 13.977916194790488)

You might also want to consider other regression options such as the statsmodels package, especially if you'll be doing this sort of thing very often. (Note that x and y are reversed compared to linregress)

import statsmodels.formula.api as smf

smf.ols( 'TotP ~ days_since', data=data ).fit().params

Intercept     13.977916
days_since     0.146659

That's just a fraction of the statsmodels output btw (use summary() instead of params to get the extra output.

Mickimickie answered 1/9, 2015 at 15:8 Comment(1)
Thanks. I ended up doing something similar but was stuck with how to get the 'day_since' as an integer, that astype is a nice trick.Legroom

© 2022 - 2024 — McMap. All rights reserved.