Pandas read_csv with different date parsers

Asked 30/4, 2018 at 10:36 Answered 2/5, 2018 at 6:48

Solved python pandas dataframe panel-data

I have a csv-file with time series data, the first column is the date in the format %Y:%m:%d and the second column is the intraday time in the format '%H:%M:%S'. I would like to import this csv-file into a multiindex dataframe or panel object.

With this code, it already works:

    _file_data = pd.read_csv(_file,
                         sep=",",
                         header=0,
                         index_col=['Date', 'Time'],
                         thousands="'",
                         parse_dates=True,
                         skipinitialspace=True
                         )

It returns the data in the following format:

Date         Time                   Volume
2016-01-04   2018-04-25 09:01:29    53645
             2018-04-25 10:01:29    123
             2018-04-25 10:01:29    1345
             ....
2016-01-05   2018-04-25 10:01:29    123
             2018-04-25 12:01:29    213
             2018-04-25 10:01:29    123

1st question: I would like to show the second index as a pure time-object not datetime. To do that, I have to declare two different date-pasers in the read_csv function, but I can't figure out how. What is the "best" way to do that?

2nd question: After I created the Dataframe, I converted it to a panel-object. Would you recommend doing that? Is the panel-object the better choice for such a data structure? What are the benefits (drawbacks) of a panel-object?

Costard answered 30/4, 2018 at 10:36 Comment(1)

For your 1st question see this – Quan 30/4, 2018 at 10:46

1st question:

You can create multiple converters and define parsers in dictionary:

import pandas as pd

temp=u"""Date,Time,Volume
2016:01:04,09:00:00,53645
2016:01:04,09:20:00,0
2016:01:04,09:40:00,0
2016:01:04,10:00:00,1468
2016:01:05,10:00:00,246
2016:01:05,10:20:00,0
2016:01:05,10:40:00,0
2016:01:05,11:00:00,0
2016:01:05,11:20:00,0
2016:01:05,11:40:00,0
2016:01:05,12:00:00,213"""

def converter1(x):
    #convert to datetime and then to times
    return pd.to_datetime(x).time()

def converter2(x):
    #define format of datetime
    return pd.to_datetime(x, format='%Y:%m:%d')

#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), 
                 index_col=['Date','Time'], 
                 thousands="'",
                 skipinitialspace=True,
                 converters={'Time': converter1, 'Date': converter2})

print (df)
                     Volume
Date       Time            
2016-01-04 09:00:00   53645
           09:20:00       0
           09:40:00       0
           10:00:00    1468
2016-01-05 10:00:00     246
           10:20:00       0
           10:40:00       0
           11:00:00       0
           11:20:00       0
           11:40:00       0
           12:00:00     213

Sometimes is possible use built-in parser, e.g. if format of dates is YY-MM-DD:

import pandas as pd

temp=u"""Date,Time,Volume
2016-01-04,09:00:00,53645
2016-01-04,09:20:00,0
2016-01-04,09:40:00,0
2016-01-04,10:00:00,1468
2016-01-05,10:00:00,246
2016-01-05,10:20:00,0
2016-01-05,10:40:00,0
2016-01-05,11:00:00,0
2016-01-05,11:20:00,0
2016-01-05,11:40:00,0
2016-01-05,12:00:00,213"""

def converter(x):
    #define format of datetime
    return pd.to_datetime(x).time()

#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), 
                 index_col=['Date','Time'], 
                 parse_dates=['Date'],
                 thousands="'",
                 skipinitialspace=True,
                 converters={'Time': converter})

print (df.index.get_level_values(0))
DatetimeIndex(['2016-01-04', '2016-01-04', '2016-01-04', '2016-01-04',
               '2016-01-05', '2016-01-05', '2016-01-05', '2016-01-05',
               '2016-01-05', '2016-01-05', '2016-01-05'],
              dtype='datetime64[ns]', name='Date', freq=None)

Last possible solution is convert datetime to times in MultiIndex by set_levels - after processing:

df.index = df.index.set_levels(df.index.get_level_values(1).time, level=1)
print (df)
                     Volume
Date       Time            
2016-01-04 09:00:00   53645
           09:20:00       0
           09:40:00       0
           10:00:00    1468
2016-01-05 10:00:00     246
           10:00:00       0
           10:20:00       0
           10:40:00       0
           11:00:00       0
           11:20:00       0
           11:40:00     213

2nd question:

Panel in pandas 0.20.+ is deprecated and will be removed in a future version.

Weight answered 2/5, 2018 at 6:48 Comment(4)

Unfortunately, only the last (least possible) solution works. With the first two solutions print (df.index.get_level_values(0)) still prints a dateformat. – Costard 2/5, 2018 at 13:43

@Costard - What is your pandas version? – Weight 2/5, 2018 at 13:46

I have pandas 0.22.0 – Costard 2/5, 2018 at 15:3

Actually, it worked out. Maybe I had a type on the first time. Thanks! – Costard 18/5, 2018 at 18:17

To convert to a time series use pd.to_timedelta.

Ex:

import pandas as pd
df = pd.DataFrame({"Time": ["2018-04-25 09:01:29", "2018-04-25 10:01:29", "2018-04-25 10:01:29"]})
df["Time"] = pd.to_timedelta(pd.to_datetime(df["Time"]).dt.strftime('%H:%M:%S'))
print df["Time"]

Output:

0   09:01:29
1   10:01:29
2   10:01:29
Name: Time, dtype: timedelta64[ns]

Corpus answered 30/4, 2018 at 10:46 Comment(2)

Thanks for your reply. I already did this approach and it worked. But, I was looking for a way to transform the data while reading the csv-file. – Costard 30/4, 2018 at 13:52

@Costard - There is some problem? – Weight 2/5, 2018 at 13:43

Recommended topics

Hot tags