Setting the correct format=
is much faster than letting pandas find out1
Long story short, passing the correct format=
from the beginning as in chrisb's post is much faster than letting pandas figure out the format, especially if the format contains time component. The runtime difference for dataframes greater than 10k rows is huge (~25 times faster, so we're talking like a couple minutes vs a few seconds). All valid format options can be found at https://strftime.org/.
errors='coerce'
is useful
If some rows are not in the correct format or not datetime at all, errors=
parameter is very useful, so that you can convert the valid rows and handle the rows that contained invalid values later.
df['date'] = pd.to_datetime(
df['date'], format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
# for multiple columns
df[['start', 'end']] = df[['start', 'end']].apply(
pd.to_datetime, format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
To silence SettingWithCopyWarning
On a side note, if you got this warning, then that means your dataframe was probably created by filtering another dataframe. Enable copy-on-write and you're good to go. (see this post for more about it).
pd.set_option('copy_on_write', True)
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f')
1 Code used to produce the timeit test plot.
import perfplot
from random import choices
from datetime import datetime
mdYHMSf = range(1,13), range(1,29), range(2000,2024), range(24), *[range(60)]*2, range(1000)
perfplot.show(
kernels=[lambda x: pd.to_datetime(x),
lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S.%f'),
lambda x: pd.to_datetime(x, infer_datetime_format=True),
lambda s: s.apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))],
labels=["pd.to_datetime(df['date'])",
"pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S.%f')",
"pd.to_datetime(df['date'], infer_datetime_format=True)",
"df['date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))"],
n_range=[2**k for k in range(20)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}:{S}.{f}"
for m,d,Y,H,M,S,f in zip(*[choices(e, k=n) for e in mdYHMSf])]),
equality_check=pd.Series.equals,
xlabel='len(df)'
)
If the column contains multiple formats, see Convert a column of mixed format strings to a datetime Dtype.
format
argument isn't required.to_datetime
is smart. Go ahead and try it without trying to match your data. – Woodrowwoodruff