Parsing timestamp column with variable timestamp format in a column in pandas
Asked Answered
C

2

1

I want to parse the time from a column of a sub second precision timeseries .csv file but it returns NaT for some timestamps.

A quirk of the dataset is, that every non-full second will be represented in %Y-%m-%d %H:%M:%S.%f while every full second will be in %Y-%m-%d %H:%M:%S

I observed that the format which will be present in the first row will be converted and the other will be NaT

Example:

import pandas as pd

# Example data
timestamps_full_first = [
    "2023-12-30 00:00:00",
    "2023-12-30 00:00:00.1",
    "2023-12-30 00:00:00.9",
    "2023-12-30 00:00:01"
]

timestamps_sub_first = [
    "2023-12-30 00:00:00.1",
    "2023-12-30 00:00:00.9",
    "2023-12-30 00:00:01",
    "2023-12-30 00:00:01.1"
]

# Convert to datetime
datetime_series_full_first = pd.to_datetime(timestamps_full_first, errors='coerce', utc=True)
datetime_series_sub_first = pd.to_datetime(timestamps_sub_first, errors='coerce', utc=True)
print(datetime_series_full_first)
print(datetime_series_sub_first)

Output:

DatetimeIndex(['2023-12-30 00:00:00+00:00', 'NaT', 'NaT',
               '2023-12-30 00:00:01+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)
DatetimeIndex(['2023-12-30 00:00:00.100000+00:00',
               '2023-12-30 00:00:00.900000+00:00',
                                            'NaT',
               '2023-12-30 00:00:01.100000+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

Initial Solution

My initial solution was to write a custom parse_date function where I can provide multiple formats as a list.

def parse_date(self, date_str, formats = ["none"]):
            for fmt in formats:
                try:
                    return pd.to_datetime(date_str, format=fmt, utc=True)
                except ValueError:
                    continue
            return pd.NaT  # Return 'Not a Time' for unrecognized formats

Usage:

data[self.timestamp_col] = data[self.timestamp_col].apply(lambda x: self.parse_date(x, formats = self.timestamp_formats))

It works, but it is very very slow compared to the internal parsing of pandas.

Optimized Solution (suggested by GPT-4):

GPT suggested to vectorize the dataframe and then use masking to parse one time and then parse the rows containing NaT a second time with the alternate format, which should improve performance.

import pandas as pd
import numpy as np

def vectorized_parse_date(date_series, formats):
    result_series = pd.Series(np.full(date_series.shape, pd.NaT), index=date_series.index)
    for fmt in formats:
        mask = result_series.isna() & ~date_series.isna()  # Only try to parse where we don't have a result and the date is not NaN
        try:
            result_series[mask] = pd.to_datetime(date_series[mask], format=fmt, errors='raise', utc=True)
        except ValueError:
            continue
    return result_series

# Usage
data[self.timestamp_col] = vectorized_parse_date(data[self.timestamp_col], formats=self.timestamp_formats)

I did not try it yet, since I feel GPT is somewhat biased by my approach and attempts to figure out how to do it inside my artificial constrains.

So maybe some of you see another solution which would use pandas functionality.

Capillarity answered 22/2 at 11:20 Comment(0)
I
2

works fine for me if I set format="ISO8601" (pandas v2 feature) - ISO8601 is the most precise description of your input format I'd say.

datetime_series_full_first = pd.to_datetime(timestamps_full_first, format="ISO8601", 
                                            utc=True, errors='coerce')
datetime_series_sub_first = pd.to_datetime(timestamps_sub_first, format="ISO8601",  
                                           utc=True, errors='coerce')
print(datetime_series_full_first)
print(datetime_series_sub_first)

DatetimeIndex([       '2023-12-30 00:00:00+00:00',
               '2023-12-30 00:00:00.100000+00:00',
               '2023-12-30 00:00:00.900000+00:00',
                      '2023-12-30 00:00:01+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)
DatetimeIndex(['2023-12-30 00:00:00.100000+00:00',
               '2023-12-30 00:00:00.900000+00:00',
                      '2023-12-30 00:00:01+00:00',
               '2023-12-30 00:00:01.100000+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)
Isobaric answered 22/2 at 11:27 Comment(2)
Thanks. Thats the kind of thing i was looking for! It works. Funny enough gpt will complain that your code does not work because the format does not exist.Capillarity
@SebastianScholz so statistically, format is not used ^^ The gpt can be hilarious.Isobaric
C
1

Simply use format='mixed':

out = pd.to_datetime(timestamps_full_first, format='mixed',
                     errors='coerce', utc=True)

Output:

DatetimeIndex([       '2023-12-30 00:00:00+00:00',
               '2023-12-30 00:00:00.100000+00:00',
               '2023-12-30 00:00:00.900000+00:00',
                      '2023-12-30 00:00:01+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

Or to use explicit formats, convert to Series and fillna:

s = pd.Series(timestamps_full_first)
out = (pd.to_datetime(s, format='%Y-%m-%d %H:%M:%S.%f',
                      errors='coerce', utc=True)
         .fillna(pd.to_datetime(s, format='%Y-%m-%d %H:%M:%S',
                                errors='coerce', utc=True))
      )

Output:

0          2023-12-30 00:00:00+00:00
1   2023-12-30 00:00:00.100000+00:00
2   2023-12-30 00:00:00.900000+00:00
3          2023-12-30 00:00:01+00:00
dtype: datetime64[ns, UTC]
Carvajal answered 22/2 at 11:25 Comment(1)
for the more general "Parsing timestamp column with variable timestamp format in a column in pandas", format='mixed' is probably the way to goIsobaric

© 2022 - 2024 — McMap. All rights reserved.