Convert a column of mixed format strings to a datetime Dtype
Asked Answered
M

1

2

When converting a pandas dataframe column from object to datetime using astype function, the behavior is different depending on if the strings have the time component or not. What is the correct way of converting the column?

df = pd.DataFrame({'Date': ['12/07/2013 21:50:00','13/07/2013 00:30:00','15/07/2013','11/07/2013']})

df['Date'] = pd.to_datetime(df['Date'], format="%d/%m/%Y %H:%M:%S", exact=False, dayfirst=True, errors='ignore')

Output:

                   Date
0   12/07/2013 21:50:00
1   13/07/2013 00:30:00
2   15/07/2013
3   11/07/2013

but the dtype is still object. When doing:

df['Date'] = df['Date'].astype('datetime64')

it becomes of datetime dtype but the day and month are not parsed correctly on rows 0 and 3.

                   Date
0   2013-12-07 21:50:00
1   2013-07-13 00:30:00
2   2013-07-15 00:00:00
3   2013-11-07 00:00:00

The expected result is:

                   Date
0   2013-07-12 21:50:00
1   2013-07-13 00:30:00
2   2013-07-15 00:00:00
3   2013-07-11 00:00:00
Malodorous answered 15/6, 2019 at 22:14 Comment(0)
J
3

If we look at the source code, if you pass format= and dayfirst= arguments, dayfirst= will never be read because passing format= calls a C function (np_datetime_strings.c) that doesn't use dayfirst= to make conversions. On the other hand, if you pass only dayfirst=, it will be used to first guess the format and falls back on dateutil.parser.parse to make conversions. So, use only one of them.


In most cases,

df['Date'] = pd.to_datetime(df['Date'])

does the job.

In the specific example in the OP, passing dayfirst=True does the job.

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

That said, passing the format= makes the conversion run ~25x faster (see this post for more info), so if your frame is anything larger than 10k rows, then it's better to pass the format=. Now since the format is mixed, one way is to perform the conversion in two steps (errors='coerce' argument will be useful)

  • convert the datetimes with time component
  • fill in the NaT values (the "coerced" rows) by a Series converted with a different format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y %H:%M:%S', errors='coerce')
df['Date'] = df['Date'].fillna(pd.to_datetime(df['Date'], format='%d/%m/%Y', errors='coerce'))

This method (of performing or more conversions) can be used to convert any column with "weirdly" formatted datetimes.


Since pandas 2.0, format= accepts 'mixed', i.e. pd.to_datetime(dates, format='mixed') but this is pretty error-prone, so it's probably better to use dayfirst=True or two-step format= (as done above) instead.

Julianjuliana answered 1/2, 2023 at 1:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.