pythonjsonpython-polars

polars (python-api): read_json fails to parse date


I want to read in a polars dataframe from a json string containing dates in the standard iso-format "yyyy-mm-dd". When I try to read the string in and set the dtype of the date column witheither schema or schema_override this results in only NULL values.

MRE

from datetime import datetime, timedelta
from io import StringIO

import polars as pl

# Generate a list of dates
start_date = datetime.today()
dates = [start_date + timedelta(days=i) for i in range(100)]
date_strings = [date.strftime("%Y-%m-%d") for date in dates]

# Create a Polars DataFrame
df = pl.DataFrame({"dates": date_strings})

df_reread = pl.read_json(
    StringIO(df.write_json()),
    schema_overrides={"dates": pl.Date},
)

output of print(df_reread)

Error

shape: (100, 1)
┌───────┐
│ dates │
│ ---   │
│ date  │
╞═══════╡
│ null  │
│ null  │
│ null  │
│ null  │
│ null  │
│ …     │
│ null  │
│ null  │
│ null  │
│ null  │
│ null  │
└───────┘

Question

Is there anyway to correctly read in the Date dtype from a json string?


Solution

  • After having a bit of a play around, it looks like unfortunately dates being read from a JSON file have a bit of a quirk. It seems to me that currently they must be written in days since the unix epoch (which is how Polars internally represents dates) for things to work as you expect.

    I have raised this feature request on their github to hopefully get that improved.

    In the mean time,

    df = (
        pl.DataFrame({"dates": "2024-01-01"})
        # add this line below
        .select(pl.col("dates").cast(pl.Date).dt.epoch("d"))
    )
    
    df_reread = pl.read_json(
        df.write_json().encode(),
        schema_overrides={"dates": pl.Date},
    )
    print(df_reread)
    # shape: (1, 1)
    # ┌────────────┐
    # │ dates      │
    # │ ---        │
    # │ date       │
    # ╞════════════╡
    # │ 2024-01-01 │
    # └────────────┘
    

    or do as you say with df_reread.with_columns(pl.col("dates").cast(pl.Date)