I want to parse the time from a column of a sub second precision timeseries .csv
file but it returns NaT
for some timestamps.
A quirk of the dataset is, that every non-full second will be represented in %Y-%m-%d %H:%M:%S.%f
while every full second will be in %Y-%m-%d %H:%M:%S
I observed that the format which will be present in the first row will be converted and the other will be NaT
Example:
import pandas as pd
# Example data
timestamps_full_first = [
"2023-12-30 00:00:00",
"2023-12-30 00:00:00.1",
"2023-12-30 00:00:00.9",
"2023-12-30 00:00:01"
]
timestamps_sub_first = [
"2023-12-30 00:00:00.1",
"2023-12-30 00:00:00.9",
"2023-12-30 00:00:01",
"2023-12-30 00:00:01.1"
]
# Convert to datetime
datetime_series_full_first = pd.to_datetime(timestamps_full_first, errors='coerce', utc=True)
datetime_series_sub_first = pd.to_datetime(timestamps_sub_first, errors='coerce', utc=True)
print(datetime_series_full_first)
print(datetime_series_sub_first)
Output:
DatetimeIndex(['2023-12-30 00:00:00+00:00', 'NaT', 'NaT',
'2023-12-30 00:00:01+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
DatetimeIndex(['2023-12-30 00:00:00.100000+00:00',
'2023-12-30 00:00:00.900000+00:00',
'NaT',
'2023-12-30 00:00:01.100000+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
Initial Solution
My initial solution was to write a custom parse_date
function where I can provide multiple formats as a list.
def parse_date(self, date_str, formats = ["none"]):
for fmt in formats:
try:
return pd.to_datetime(date_str, format=fmt, utc=True)
except ValueError:
continue
return pd.NaT # Return 'Not a Time' for unrecognized formats
Usage:
data[self.timestamp_col] = data[self.timestamp_col].apply(lambda x: self.parse_date(x, formats = self.timestamp_formats))
It works, but it is very very slow compared to the internal parsing of pandas.
Optimized Solution (suggested by GPT-4):
GPT suggested to vectorize the dataframe and then use masking to parse one time and then parse the rows containing NaT
a second time with the alternate format, which should improve performance.
import pandas as pd
import numpy as np
def vectorized_parse_date(date_series, formats):
result_series = pd.Series(np.full(date_series.shape, pd.NaT), index=date_series.index)
for fmt in formats:
mask = result_series.isna() & ~date_series.isna() # Only try to parse where we don't have a result and the date is not NaN
try:
result_series[mask] = pd.to_datetime(date_series[mask], format=fmt, errors='raise', utc=True)
except ValueError:
continue
return result_series
# Usage
data[self.timestamp_col] = vectorized_parse_date(data[self.timestamp_col], formats=self.timestamp_formats)
I did not try it yet, since I feel GPT is somewhat biased by my approach and attempts to figure out how to do it inside my artificial constrains.
So maybe some of you see another solution which would use pandas functionality.
format
does not exist. – Capillarity