Pandas read_excel doesn't parse dates correctly - returns a constant date instead
Asked Answered
S

1

7

I've read a .xlsb file and parsed date columns using a code below:

dateparser = lambda x: pd.to_datetime(x)

data = pd.read_excel(r"test.xlsb", engine="pyxlsb",
                 parse_dates=["start_date","end_date"],
                 date_parser=dateparser
                 )

My input columns in the .xlsb file have format DD/MM/YYYY (e.g. 26/01/2008). As an output of the above-mentioned code I get, for example: 1970-01-01 00:00:00.000038840. Only the last 5 digits changes.

If I read the same file without parsing dates, the same columns are of float64 type and containing only the last 5 digits of output before (e.g. 38840.0).

I assume this is a problem associated with date encoding itself. Does anyone know how to fix this issue?

Sadesadella answered 13/4, 2020 at 11:17 Comment(0)
W
9

I am not sure if you were able to figure out the answer to this problem. But, below is how I resolved it:

from pyxlsb import convert_date

self.data: pd.DataFrame = pd.read_excel(self.file, sheet_name=self.sheet, engine='pyxlsb', header=0)
self.data["test"] = self.data.apply(lambda x: convert_date(x.SomeStupidDate), axis=1)

More details can be found here: https://pypi.org/project/pyxlsb/ by doing ctrl+F for "convert_date".

Wretched answered 29/5, 2020 at 1:13 Comment(1)
It's a shame that pyxlsb operates this way. I often face use cases where I wouldn't know in advance that a column in a sheet is supposed to be a date, and pyxlsb reading it as a float means having to depend on the column name actually including "date", or some other signifier. Other Excel engines don't have this issue.Grekin

© 2022 - 2025 — McMap. All rights reserved.