Storage of timeseries data in python
Asked Answered
U

2

6

I've a amazon price data for around 8.5k products from a period of Feb. 1, 2015 to Oct. 31, 2015. Currently, it is in the form of a dictionary with key as number of days from base date and value as the new price starting that day. Example, here the price is $10 from day 1 and changes to $15 on 45th day, and then changes to $9 on 173rd day and doesn't change after that.

{1:10,
 45:15,
 .
 .
 .
 173:9}

What is the best way to store such a timeseries for easy manipulation using python? I would like to perform a lot of aggregations and also would be querying a price on a particular date. Lastly, I would be performing some fixed effect regressions and am confused what would be best way to store this timeseries, so that my programming job becomes comparatively simpler. I could possibly store as table with 273 columns (each for a day) and rows corresponding to 8.5k products. I've been looking at pandas module which can help me do this, but is there a better way? Thanks!

Unawares answered 9/11, 2015 at 15:40 Comment(5)
pandas combined with numpy would probably be a good option.Campinas
@Campinas I was of the opinion that pandas itself heavily relies on numpy by default, but is there a specific use of numpy that you see here? Thanks!Unawares
Pandas does use numpy a lot and no I do not but that's because I don't really know what you will be doing with your data. I just assumed you would be doing lots of computations and numpy is optimized for that kind of stuff.Campinas
@harshal give some details about products name listKanzu
@Kanzu The products name list is just an id for each product, so 8.5k different ids.Unawares
C
5

You could use a dict of dicts and convert that into a pandas dataframe and also use numpy to do calculations. Your first key would be product and the inner dict would be the one you already have but it won't print in the format you suggested but all you would need to do is transpose it so for a quick example

import pandas as pd

d = {'Product1': {1:10, 45:15, 173:9}, 'Product2': {1:11, 100:50, 173:10}}
df = pd.DataFrame(d).T
print df

          1    45   100  173
Product1   10   15  NaN    9
Product2   11  NaN   50   10
Campinas answered 9/11, 2015 at 15:50 Comment(2)
Yes, this is what I meant by the table, just that this is going to be an extremely sparse table, since it's not the case that price of every product changes on 45th day, all product price changes are independent, and I expect this sort of dataframe to end up with 273 columns (total number of days) anyways. But Thanks nonethless!Unawares
@Unawares I updated my answer to show that it can deal with missing data.Campinas
K
1

8.5k products and 270+ days I would recommend dataframe this way,

price_dic = {1: 10, 2: 11, 3: 12, 5: 15}

df = pd.DataFrame({'days': pd.Series(price_dic.keys(),index=range(len(price_dic))),'price': pd.Series(price_dic.values(),index=range(len(price_dic)))})

df['prod_name'] = "Knote"

df
Out[80]: 
   days  price prod_name
0     1     10     Knote
1     2     11     Knote
2     3     12     Knote
3     5     15     Knote

df['Date'] = pd.to_datetime("Feb. 1, 2015") + pd.to_timedelta(df.days,'D')

df
Out[82]: 
   days  price prod_name       Date
0     1     10     Knote 2015-02-02
1     2     11     Knote 2015-02-03
2     3     12     Knote 2015-02-04
3     5     15     Knote 2015-02-06

Update:

Treversing list and getting final Dataframe with all content,

Lets say you have prod list, price list and start date list like below, we could do,

product_list = [1001,1002,1003]

y_dict = [{1: 10, 2: 11, 3: 12, 5: 15},
            {1: 10, 3: 11, 6: 12, 8: 15},
            {1: 90, 2: 100, 7: 120, 9: 100}]

start_dt_list = ['Feb 05 2015','Feb 01 2015','Feb 06 2015']

fdf = pd.DataFrame(columns =['P_ID','Date','Price','Days'])

Out[73]: 
Empty DataFrame
Columns: [P_ID, Date, Price, Days]
Index: []

for pid,j ,st_dt in zip(product_list, y_dict,start_dt_list):
    df = pd.DataFrame({'P_ID' : pd.Series([pid]*len(j)) ,
                    'Date' : pd.Series([pd.to_datetime(st_dt)]*len(j)),
                    'Price': pd.Series(j.values(),index=range(len(j))),
                    'Days': pd.Series(j.keys(),index=range(len(j)))
                    })
    fdf = fdf.append(df,ignore_index=True)


fdf.head(2)
Out[75]: 
        Date  Days  P_ID  Price
0 2015-02-05     1  1001     10
1 2015-02-05     2  1001     11

fdf['Date'] = fdf['Date'] + pd.to_timedelta(fdf.Days,'D')

fdf
Out[77]: 
         Date  Days  P_ID  Price
0  2015-02-06     1  1001     10
1  2015-02-07     2  1001     11
2  2015-02-08     3  1001     12
3  2015-02-10     5  1001     15
4  2015-02-09     8  1002     15
5  2015-02-02     1  1002     10
6  2015-02-04     3  1002     11
7  2015-02-07     6  1002     12
8  2015-02-07     1  1003     90
9  2015-02-08     2  1003    100
10 2015-02-15     9  1003    100
11 2015-02-13     7  1003    120
Kanzu answered 9/11, 2015 at 19:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.