Is there a better way to only return each pl.element()
in a polars list if it matches an item contained within another list?
While it works, I believe there's probably a more concise/better way:
import polars as pl
terms = ['a', 'z']
(pl.LazyFrame({'a':['x y z']})
.select(pl.col('a')
.str.split(' ')
.list.eval(pl.when(pl.element().is_in(terms))
.then(pl.element())
.otherwise(None))
.list.drop_nulls()
.list.join(' ')
)
.collect()
)
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ str │
╞═════╡
│ z │
└─────┘
For posterity's sake, it replaces my previous attempt using .map_elements():
import polars as pl
import re
terms = ['a', 'z']
(pl.LazyFrame({'a':['x y z']})
.select(pl.col('a')
.map_elements(lambda x: ' '.join(list(set(re.findall('|'.join(terms), x)))),
return_dtype = pl.String)
)
._fetch()
)
@jqurious and @Dean MacGregor were exactly right, I just wanted to post an solution that explained the differences succinctly:
terms = ['a', 'z']
(pl.LazyFrame({'a':['x a y zebra']})
.with_columns(only_whole_terms = pl.col('a')
.str.split(' ')
.list.set_intersection(terms),
each_term = pl.col('a').str.extract_all('|'.join(terms)),
)
.fetch()
)
shape: (1, 3)
┌─────────────┬──────────────────┬─────────────────┐
│ a ┆ only_whole_terms ┆ each_term │
│ --- ┆ --- ┆ --- │
│ str ┆ list[str] ┆ list[str] │
╞═════════════╪══════════════════╪═════════════════╡
│ x a y zebra ┆ ["a"] ┆ ["a", "z", "a"] │
└─────────────┴──────────────────┴─────────────────┘
Also, this closely related question adds a bit more.