pythondataframecastingpython-polars

Polars equivalent for casting strings like Pandas to_numeric


When applying pandas.to_numeric(), the return dtype is float64 or int64 depending on the data supplied.

Is there an equivalent to do this in polars?

import pandas as pd
import polars as pl

df = pl.from_repr("""
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ 1    ┆ 3.5  │
│ 2    ┆ 4.6  │
└──────┴──────┘
""")

pl.from_pandas(df.to_pandas().apply(pd.to_numeric))
# shape: (2, 2)
# ┌──────┬──────┐
# │ col1 ┆ col2 │
# │ ---  ┆ ---  │
# │ i64  ┆ f64  │
# ╞══════╪══════╡
# │ 1    ┆ 3.5  │
# │ 2    ┆ 4.6  │
# └──────┴──────┘

Solution

  • Unlike Pandas, Polars is quite picky about datatypes and tends to be rather unaccommodating when it comes to automatic casting. (Among the reasons is performance.)

    You can create a feature request for a to_numeric method (but I'm not sure how enthusiastic the response will be.)

    That said, here's some easy ways to accomplish this.

    Create a method

    Perhaps the simplest way is to write a method that attempts the cast to integer and then catches the exception. For convenience, you can even attach this method to the Series class itself.

    def to_numeric(s: pl.Series) -> pl.Series:
        try:
            result = s.cast(pl.Int64)
        except pl.exceptions.InvalidOperationError:
            result = s.cast(pl.Float64)
        return result
    
    
    pl.Series.to_numeric = to_numeric
    

    Then to use it:

    (
        pl.select(
            s.to_numeric()
            for s in df
        )
    )
    
    shape: (2, 2)
    ┌──────┬──────┐
    │ col1 ┆ col2 │
    │ ---  ┆ ---  │
    │ i64  ┆ f64  │
    ╞══════╪══════╡
    │ 1    ┆ 3.5  │
    │ 2    ┆ 4.6  │
    └──────┴──────┘
    

    Use the automatic casting of csv parsing

    Another method is to write your columns to a csv file (in a string buffer), and then have read_csv try to infer the types automatically. You may have to tweak the infer_schema_length parameter in some situations.

    from io import StringIO
    pl.read_csv(StringIO(df.write_csv()))
    
    shape: (2, 2)
    ┌──────┬──────┐
    │ col1 ┆ col2 │
    │ ---  ┆ ---  │
    │ i64  ┆ f64  │
    ╞══════╪══════╡
    │ 1    ┆ 3.5  │
    │ 2    ┆ 4.6  │
    └──────┴──────┘