pythonpython-polars

How to select the longest string from a list of strings in polars?


How do I select the longest string from a list of strings in polars?

Example and expected output:

import polars as pl

df = pl.DataFrame({
    "values": [
        ["the", "quickest", "brown", "fox"],
        ["jumps", "over", "the", "lazy", "dog"],
        []
    ]
})
┌──────────────────────────────┬────────────────┐
│ values                       ┆ longest_string │
│ ---                          ┆ ---            │
│ list[str]                    ┆ str            │
╞══════════════════════════════╪════════════════╡
│ ["the", "quickest", … "fox"] ┆ quickest       │
│ ["jumps", "over", … "dog"]   ┆ jumps          │
│ []                           ┆ null           │
└──────────────────────────────┴────────────────┘

My use case is to select the longest overlapping match.

Edit: elaborating on the longest overlapping match, this is the output for the example provided by polars:

┌────────────┬───────────┬─────────────────────────────────┐
│ values     ┆ matches   ┆ matches_overlapping             │
│ ---        ┆ ---       ┆ ---                             │
│ str        ┆ list[str] ┆ list[str]                       │
╞════════════╪═══════════╪═════════════════════════════════╡
│ discontent ┆ ["disco"] ┆ ["disco", "onte", "discontent"] │
└────────────┴───────────┴─────────────────────────────────┘

I desire a way to select the longest match in matches_overlapping.


Solution

  • You can do something like:

    df.with_columns(
        pl.col('values').list.get(
            pl.col('values')
            .list.eval(pl.element().str.len_chars())
            .list.arg_max()
        )
        .alias('longest_string')
    )
    

    This expression:

    pl.col('values')
    .list.eval(pl.element().str.len_chars())
    .list.arg_max()
    

    first maps len_chars to each string in each of the lists with .list.eval, then it finds the arg_max (the index of the max element, so in this case, the index of the max length).

    The result of that is passed to list.get to retrieve those values.