How do I select the longest string from a list of strings in polars?
Example and expected output:
import polars as pl
df = pl.DataFrame({
"values": [
["the", "quickest", "brown", "fox"],
["jumps", "over", "the", "lazy", "dog"],
[]
]
})
┌──────────────────────────────┬────────────────┐
│ values ┆ longest_string │
│ --- ┆ --- │
│ list[str] ┆ str │
╞══════════════════════════════╪════════════════╡
│ ["the", "quickest", … "fox"] ┆ quickest │
│ ["jumps", "over", … "dog"] ┆ jumps │
│ [] ┆ null │
└──────────────────────────────┴────────────────┘
My use case is to select the longest overlapping match.
Edit: elaborating on the longest overlapping match, this is the output for the example provided by polars:
┌────────────┬───────────┬─────────────────────────────────┐
│ values ┆ matches ┆ matches_overlapping │
│ --- ┆ --- ┆ --- │
│ str ┆ list[str] ┆ list[str] │
╞════════════╪═══════════╪═════════════════════════════════╡
│ discontent ┆ ["disco"] ┆ ["disco", "onte", "discontent"] │
└────────────┴───────────┴─────────────────────────────────┘
I desire a way to select the longest match in matches_overlapping
.
You can do something like:
df.with_columns(
pl.col('values').list.get(
pl.col('values')
.list.eval(pl.element().str.len_chars())
.list.arg_max()
)
.alias('longest_string')
)
This expression:
pl.col('values')
.list.eval(pl.element().str.len_chars())
.list.arg_max()
first maps len_chars
to each string in each of the lists with .list.eval
, then it finds the arg_max
(the index of the max element, so in this case, the index of the max length).
The result of that is passed to list.get
to retrieve those values.