dataframepython-polars

Polars native API instead of the slow "map_elements"


I have a dataframe with fields age, diagnosis, etc that I'm reading from a csv file. I need to add a column code (a disease code) that depends on diagnosis and/or age. I am currently using the map_elements method:

disease_codes = {
"malaria": {"under_5": "a", "over_5": "b"},
"PUD": "c",
"Asthma": "d",
}
# The callable to return the appropriate code
def return_code(row):
    diagnosis = row["diagnosis"]
    age = row["age"]
    dx_return = disease_codes.get(diagnosis, "Undefined")
    if type(dx_return) == dict:
        if age >= 5:
            return dx_return.get("over_5")
        return dx_return.get("under_5")
    return dx_return
df.with_columns(pl.struct(["diagnosis", "age"]).map_elements(return_code).alias("code"))

This is working as expected. But because this method uses a for loop, there is a warning in the docs:

This method is much slower than native expressions API. Only use it if you cannot implement your logic otherwise.

I have been trying to find a way to achieve it with the native API and I thought map_batches came close:

def return_code(diagnosis, age):
    dx_return = disease_codes.get(diagnosis, "Undefined")
    if type(dx_return) == dict:
        if age >= 5:
            return dx_return.get("over_5")
        return dx_return.get("under_5")
    return dx_return

df.with_columns(
    (pl.struct(["diagnosis", "age"]).map_batches(
        lambda x: return_code(x.struct.field("diagnosis"), x.struct.field("age"))
    )).alias("code")
)

Apparently, I can't do that as I get an error:

TypeError: cannot use `__getitem__` on Series of dtype Struct([Field('diagnosis', Utf8), Field('age', Int64)]) with argument 'diagnosis' of type 'str'

Looking at the documentation, this is by far the only method I thought would do the trick. Am I stuck with the slower approach? Is there something I am not doing right? What would you suggest?

The expected output, which I am getting with the working approach, should be: enter image description here


Solution

  • Method 1. You could use a when-then-otherwise expression as follows:

    import polars as pl
        
    df = pl.DataFrame({
        "age": [2, 37, 53, 1, 52],
        "diagnosis": ["Malaria", "PUD", "Malaria", "PUD", "Other"],
    })
    
    (
        df
        .with_columns(
            pl.when(pl.col("diagnosis") == "Malaria").then(
                pl.when(pl.col("age") < 5).then(pl.lit("a")).otherwise(pl.lit("b"))
            ).otherwise(
                pl.when(pl.col("diagnosis") == "PUD").then(pl.lit("c")).otherwise(pl.lit("d"))
            ).alias("disease_code"),
        )
    )
    

    If there are many conditions depending only on a single column (e.g. diagnosis) and few depending on another column (e.g. age), it might be reasonable to first start with pl.map_dict. Subsequently, you could handle the age-dependant special cases with the when-then-otherwise expressions as outlined above.


    Method 2. Alternatively, you could create (using any method) a table of disease codes, add the necessary columns to df, and merge the tables.

    This could look look as follows:

    disease_codes = pl.DataFrame({
        "diagnosis": ["Malaria"] * 2 + ["PUD"] * 2 + ["Other"] * 2,
        "le_5": [True, False] * 3,
        "code": ["a", "b", "c", "c", "d", "d"]
    })
        
    (
        df
        .with_columns(le_5=pl.col("age") < 5)
        .join(disease_codes, on=["diagnosis", "le_5"], how="left")
        .drop("le_5")
    )
    

    The output will be identical the one shown above.