how to handle timestamps from summer and winter when converting strings in polars
Asked Answered
O

2

5

I'm trying to convert string timestamps to polars datetime from the timestamps my camera puts in it RAW file metadata, but polars throws this error when I have timestamps from both summer time and winter time.

ComputeError: Different timezones found during 'strptime' operation.

How do I persuade it to convert these successfully? (ideally handling different timezones as well as the change from summer to winter time)

And then how do I convert these timestamps back to the proper local clocktime for display?

Note that while the timestamp strings just show the offset, there is an exif field "Time Zone City" in the metadata as well as fields with just the local (naive) timestamp

import polars as plr

testdata=[
    {'name': 'BST 11:06', 'ts': '2022:06:27 11:06:12.16+01:00'},
    {'name': 'GMT 7:06', 'ts': '2022:12:27 12:06:12.16+00:00'},
]

pdf = plr.DataFrame(testdata)
pdfts = pdf.with_column(plr.col('ts').str.strptime(plr.Datetime, fmt = "%Y:%m:%d %H:%M:%S.%f%z"))

print(pdf)
print(pdfts)

It looks like I need to use tz_convert, but I cannot see how to add it to the conversion expression and what looks like the relevant docpage just 404's broken link to dt_namespace

Orling answered 4/1, 2023 at 10:7 Comment(0)
S
6

polars 0.16 update

Since PR 6496, was merged you can parse mixed offsets to UTC, then set the time zone:

import polars as pl

pdf = pl.DataFrame([
    {'name': 'BST 11:06', 'ts': '2022:06:27 11:06:12.16+01:00'},
    {'name': 'GMT 7:06', 'ts': '2022:12:27 12:06:12.16+00:00'},
])

pdfts = pdf.with_columns(
    pl.col('ts').str.to_datetime("%Y:%m:%d %H:%M:%S%.f%z")
    .dt.convert_time_zone("Europe/London")
)

print(pdfts)
shape: (2, 2)
┌───────────┬─────────────────────────────┐
│ name      ┆ ts                          │
│ ---       ┆ ---                         │
│ str       ┆ datetime[μs, Europe/London] │
╞═══════════╪═════════════════════════════╡
│ BST 11:06 ┆ 2022-06-27 11:06:12.160 BST │
│ GMT 7:06  ┆ 2022-12-27 12:06:12.160 GMT │
└───────────┴─────────────────────────────┘

old version:

Here's a work-around you could use: remove the UTC offset and localize to a pre-defined time zone. Note: the result will only be correct if UTC offsets and time zone agree.

timezone = "Europe/London"

pdfts = pdf.with_column(
    plr.col('ts')
    .str.replace("[+|-][0-9]{2}:[0-9]{2}", "")
    .str.strptime(plr.Datetime, fmt="%Y:%m:%d %H:%M:%S%.f")
    .dt.tz_localize(timezone)
)

print(pdf)
┌───────────┬──────────────────────────────┐
│ name      ┆ ts                           │
│ ---       ┆ ---                          │
│ str       ┆ str                          │
╞═══════════╪══════════════════════════════╡
│ BST 11:06 ┆ 2022:06:27 11:06:12.16+01:00 │
│ GMT 7:06  ┆ 2022:12:27 12:06:12.16+00:00 │
└───────────┴──────────────────────────────┘
print(pdfts)
┌───────────┬─────────────────────────────┐
│ name      ┆ ts                          │
│ ---       ┆ ---                         │
│ str       ┆ datetime[ns, Europe/London] │
╞═══════════╪═════════════════════════════╡
│ BST 11:06 ┆ 2022-06-27 11:06:12.160 BST │
│ GMT 7:06  ┆ 2022-12-27 12:06:12.160 GMT │
└───────────┴─────────────────────────────┘

Side-Note: to be fair, pandas does not handle mixed UTC offsets either, unless you parse to UTC straight away (keyword utc=True in pd.to_datetime). With mixed UTC offsets, it falls back to using series of native Python datetime objects. That makes a lot of the pandas time series functionality like the dt accessor unavailable.

Systemic answered 4/1, 2023 at 11:2 Comment(1)
Ah that is much better, I'll try that when I get 0.16Orling
P
1

Similar to FObersteiner's solution but this will manually parse the offset rather than having to assume your camera's offset matches a predefined timezone definition correctly.

First step is to use extract regex to separate the offset from the rest of the time. The offset is split into the hours and minutes inclusive of the sign. Then we just strptime the datetime component from the first step as a naive time, add/subtract the offset, localize it to UTC, and then make it the desired timezone (in this case Europe/London). **(I load polars as pl not plr so adjust as necessary)

(pdf 
.with_columns(
    [pl.col('ts').str.extract("(\d{4}:\d{2}:\d{2} \d{2}:\d{2}:\d{2}\.\d{2})"),
     pl.col('ts').str.extract("\d{4}:\d{2}:\d{2} \d{2}:\d{2}:\d{2}\.\d{2}((\+|-)\d{2}):\d{2}")
                 .cast(pl.Float64()).alias("offset"),
     pl.col('ts').str.extract("\d{4}:\d{2}:\d{2} \d{2}:\d{2}:\d{2}\.\d{2}(\+|-)\d{2}:(\d{2})", group_index=2)
                 .cast(pl.Float64()).alias("offset_minute")])
.select(
    ['name', 
     (pl.col('ts').str.strptime(pl.Datetime(), "%Y:%m:%d %H:%M:%S%.f") - pl.duration(hours=pl.col('offset'), minutes=pl.col('offset_minute')))
                  .dt.tz_localize('UTC').dt.with_time_zone('Europe/London')]))




shape: (2, 3)
┌───────────┬────────┬─────────────────────────────┐
│ name      ┆ offset ┆ dt                          │
│ ---       ┆ ---    ┆ ---                         │
│ str       ┆ f64    ┆ datetime[ns, Europe/London] │
╞═══════════╪════════╪═════════════════════════════╡
│ BST 11:06 ┆ 1.0    ┆ 2022-06-27 11:06:12.160 BST │
│ GMT 7:06  ┆ 0.0    ┆ 2022-12-27 12:06:12.160 GMT │
└───────────┴────────┴─────────────────────────────┘
Prink answered 4/1, 2023 at 17:23 Comment(1)
Those help, and I prefer FObersteiner answer just because it is a little more obvious what is going on. I do want to get the original clocktime back though, so I think I'll keep the offset in an extra field as well.Orling

© 2022 - 2025 — McMap. All rights reserved.