Pandas internals question: I've been surprised to find a few times that explicitly passing a callable to date_parser
within pandas.read_csv
results in much slower read time than simply using infer_datetime_format=True
.
Why is this? Will timing differences between these two options be date-format-specific, or what other factors will influence their relative timing?
In the below case, infer_datetime_format=True
takes one-tenth the time of passing a date parser with a specified format. I would have naively assumed the latter would be faster because it's explicit.
The docs do note,
[if True,] pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.
but there's not much detail given and I was unable to work my way fully through the source.
Setup:
from io import StringIO
import numpy as np
import pandas as pd
np.random.seed(444)
dates = pd.date_range('1980', '2018')
df = pd.DataFrame(np.random.randint(0, 100, (len(dates), 2)),
index=dates).add_prefix('col').reset_index()
# Something reproducible to be read back in
buf = StringIO()
df.to_string(buf=buf, index=False)
def read_test(**kwargs):
# Not ideal for .seek() to eat up runtime, but alleviate
# this with more loops than needed in timing below
buf.seek(0)
return pd.read_csv(buf, sep='\s+', parse_dates=['index'], **kwargs)
# dateutil.parser.parser called in this case, according to docs
%timeit -r 7 -n 100 read_test()
18.1 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit -r 7 -n 100 read_test(infer_datetime_format=True)
19.8 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Doesn't change with native Python datetime.strptime either
%timeit -r 7 -n 100 read_test(date_parser=lambda dt: pd.datetime.strptime(dt, '%Y-%m-%d'))
187 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
I'm interested in knowing a bit about what is going on internally with infer
to give it this advantage. My old understanding was that there was already some type of inference going on in the first place because dateutil.parser.parser
is used if neither is passed.
Update: did some digging on this but haven't been able to answer the question.
read_csv()
calls a helper function which in turn calls pd.core.tools.datetimes.to_datetime()
. That function (accessible as just pd.to_datetime()
) has both an infer_datetime_format
and a format
argument.
However, in this case, the relative timings are very different and don't reflect the above:
s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000']*1000)
%timeit pd.to_datetime(s,infer_datetime_format=True)
19.8 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.to_datetime(s,infer_datetime_format=False)
1.01 s ± 65.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# This was taking the longest with i/o functions,
# now it's behaving "as expected"
%timeit pd.to_datetime(s,format='%m/%d/%Y')
19 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)