I'm new to Pandas and Zipline, and I'm trying to learn how to use them (and use them with this data that I have). Any sorts of tips, even if no full solution, would be much appreciated. I have tried a number of things, and have gotten quite close, but run into indexing issues, Exception: Reindexing only valid with uniquely valued Index objects
, in particular. [Pandas 0.10.0, Python 2.7]
I'm trying to transform monthly returns data I have for thousands of stocks in postgres from the form:
ticker_symbol :: String, monthly_return :: Float, date :: Timestamp
e.g.
AAPL, 0.112, 28/2/1992
GS, 0.13, 30/11/1981
GS, -0.23, 22/12/1981
NB: The frequency of the reporting is monthly, but there is going to be considerable NaN data here, as not all of the over 6000 companies I have here are going to be around at the same time.
…to the form described below, which is what Zipline needs to run its backtester. (I think. Can Zipline's backtester work with monthly data like this, easily? I know it can, but any tips for doing this?)
The below is a DataFrame (of timeseries? How do you say this?), in the format I need:
> data
:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2268 entries, 1993-01-04 00:00:00+00:00 to 2001-12-31 00:00:00+00:00
Data columns:
AA 2268 non-null values
AAPL 2268 non-null values
GE 2268 non-null values
IBM 2268 non-null values
JNJ 2268 non-null values
KO 2268 non-null values
MSFT 2268 non-null values
PEP 2268 non-null values
SPX 2268 non-null values
XOM 2268 non-null values
dtypes: float64(10)
The below is a TimeSeries, and is in the format I need.
> data.AAPL
:
Date
1993-01-04 00:00:00+00:00 73.00
1993-01-05 00:00:00+00:00 73.12
...
2001-12-28 00:00:00+00:00 36.15
2001-12-31 00:00:00+00:00 35.55
Name: AAPL, Length: 2268
Note, there isn't return data here, but prices instead. They're adjusted (by Zipline's load_from_yahoo
—though, from reading the source, really by functions in pandas) for dividends, splits, etc, so there's an isomorphism (less the initial price) between that and my return data (so, no problem here).
(EDIT: Let me know if you'd like me to write what I have, or attach my iPython notebook or a gist; I just doubt it'd be helpful, but I can absolutely do it if requested.)
df4
, the columns are named(monthly returns, TICKER)
; what does this mean? This is different than the names Zipline returns (justTICKER
). Additionally, are the columns TimeSeries? (I haven't tested yet…) Finally… any tips on becoming quickly proficient with pandas? Thanks again. – Evergreen