pythondatetimetimestampstrftimepython-polars

how to handle timestamps from summer and winter when converting strings in polars


I'm trying to convert string timestamps to polars datetime from the timestamps my camera puts in it RAW file metadata, but polars throws this error when I have timestamps from both summer time and winter time.

ComputeError: Different timezones found during 'strptime' operation.

How do I persuade it to convert these successfully? (ideally handling different timezones as well as the change from summer to winter time)

And then how do I convert these timestamps back to the proper local clocktime for display?

Note that while the timestamp strings just show the offset, there is an exif field "Time Zone City" in the metadata as well as fields with just the local (naive) timestamp

import polars as plr

testdata=[
    {'name': 'BST 11:06', 'ts': '2022:06:27 11:06:12.16+01:00'},
    {'name': 'GMT 7:06', 'ts': '2022:12:27 12:06:12.16+00:00'},
]

pdf = plr.DataFrame(testdata)
pdfts = pdf.with_column(plr.col('ts').str.strptime(plr.Datetime, fmt = "%Y:%m:%d %H:%M:%S.%f%z"))

print(pdf)
print(pdfts)

It looks like I need to use tz_convert, but I cannot see how to add it to the conversion expression and what looks like the relevant docpage just 404's broken link to dt_namespace


Solution

  • polars 0.16 update

    Since PR 6496, was merged you can parse mixed offsets to UTC, then set the time zone:

    import polars as pl
    
    pdf = pl.DataFrame([
        {'name': 'BST 11:06', 'ts': '2022:06:27 11:06:12.16+01:00'},
        {'name': 'GMT 7:06', 'ts': '2022:12:27 12:06:12.16+00:00'},
    ])
    
    pdfts = pdf.with_columns(
        pl.col('ts').str.to_datetime("%Y:%m:%d %H:%M:%S%.f%z")
        .dt.convert_time_zone("Europe/London")
    )
    
    print(pdfts)
    shape: (2, 2)
    ┌───────────┬─────────────────────────────┐
    │ name      ┆ ts                          │
    │ ---       ┆ ---                         │
    │ str       ┆ datetime[μs, Europe/London] │
    ╞═══════════╪═════════════════════════════╡
    │ BST 11:06 ┆ 2022-06-27 11:06:12.160 BST │
    │ GMT 7:06  ┆ 2022-12-27 12:06:12.160 GMT │
    └───────────┴─────────────────────────────┘
    

    old version:

    Here's a work-around you could use: remove the UTC offset and localize to a pre-defined time zone. Note: the result will only be correct if UTC offsets and time zone agree.

    timezone = "Europe/London"
    
    pdfts = pdf.with_column(
        plr.col('ts')
        .str.replace("[+|-][0-9]{2}:[0-9]{2}", "")
        .str.strptime(plr.Datetime, fmt="%Y:%m:%d %H:%M:%S%.f")
        .dt.tz_localize(timezone)
    )
    
    print(pdf)
    ┌───────────┬──────────────────────────────┐
    │ name      ┆ ts                           │
    │ ---       ┆ ---                          │
    │ str       ┆ str                          │
    ╞═══════════╪══════════════════════════════╡
    │ BST 11:06 ┆ 2022:06:27 11:06:12.16+01:00 │
    │ GMT 7:06  ┆ 2022:12:27 12:06:12.16+00:00 │
    └───────────┴──────────────────────────────┘
    print(pdfts)
    ┌───────────┬─────────────────────────────┐
    │ name      ┆ ts                          │
    │ ---       ┆ ---                         │
    │ str       ┆ datetime[ns, Europe/London] │
    ╞═══════════╪═════════════════════════════╡
    │ BST 11:06 ┆ 2022-06-27 11:06:12.160 BST │
    │ GMT 7:06  ┆ 2022-12-27 12:06:12.160 GMT │
    └───────────┴─────────────────────────────┘
    

    Side-Note: to be fair, pandas does not handle mixed UTC offsets either, unless you parse to UTC straight away (keyword utc=True in pd.to_datetime). With mixed UTC offsets, it falls back to using series of native Python datetime objects. That makes a lot of the pandas time series functionality like the dt accessor unavailable.