How to select the longest string from a list of strings in polars?

How do I select the longest string from a list of strings in polars?

Example and expected output:

import polars as pl

df = pl.DataFrame({
    "values": [
        ["the", "quickest", "brown", "fox"],
        ["jumps", "over", "the", "lazy", "dog"],
        []
    ]
})

┌──────────────────────────────┬────────────────┐
│ values                       ┆ longest_string │
│ ---                          ┆ ---            │
│ list[str]                    ┆ str            │
╞══════════════════════════════╪════════════════╡
│ ["the", "quickest", … "fox"] ┆ quickest       │
│ ["jumps", "over", … "dog"]   ┆ jumps          │
│ []                           ┆ null           │
└──────────────────────────────┴────────────────┘

My use case is to select the longest overlapping match.

Edit: elaborating on the longest overlapping match, this is the output for the example provided by polars:

┌────────────┬───────────┬─────────────────────────────────┐
│ values     ┆ matches   ┆ matches_overlapping             │
│ ---        ┆ ---       ┆ ---                             │
│ str        ┆ list[str] ┆ list[str]                       │
╞════════════╪═══════════╪═════════════════════════════════╡
│ discontent ┆ ["disco"] ┆ ["disco", "onte", "discontent"] │
└────────────┴───────────┴─────────────────────────────────┘

I desire a way to select the longest match in matches_overlapping.

Solution

You can do something like:

df.with_columns(
    pl.col('values').list.get(
        pl.col('values')
        .list.eval(pl.element().str.len_chars())
        .list.arg_max()
    )
    .alias('longest_string')
)

This expression:

pl.col('values')
.list.eval(pl.element().str.len_chars())
.list.arg_max()

first maps len_chars to each string in each of the lists with .list.eval, then it finds the arg_max (the index of the max element, so in this case, the index of the max length).

The result of that is passed to list.get to retrieve those values.