How do I return a column of all matching terms or substrings found within a string? I suspect there's a way to do it with pl.any_horizontal()
as suggested in these comments but I can't quite piece it together.
import re
terms = ['a', 'This', 'e']
(pl.DataFrame({'col': 'This is a sentence'})
.with_columns(matched_terms = pl.col('col').map_elements(lambda x: list(set(re.findall('|'.join(terms), x)))))
)
The column should return: ['a', 'This', 'e']
EDIT:
The winning solution here: .str.extract_all('|'.join(terms)).list.unique()
is different from this closely related question's winning solution: pl.col('col').str.split(' ').list.set_intersection(terms)
because .set_intersection()
doesn't get sub-strings of list elements (such as partial, not full, words).
I've included the accompanying term-matching columns, but the each_term column with pl.col('a').str.extract_all('|'.join(terms))
was the best solution for me.
pl.Config.set_fmt_table_cell_list_len(4)
terms = ['A', 'u', 'bug', 'g']
(pl.DataFrame({'a': 'A bug in a rug.'})
.select(has_term = pl.col('a').str.contains_any(terms),
has_term2 = pl.col('a').str.contains('|'.join(terms)),
has_term3 = pl.any_horizontal(pl.col("a").str.contains(t) for t in terms),
each_term = pl.col('a').str.extract_all('|'.join(terms)),
whole_terms = pl.col('a').str.split(' ').list.set_intersection(terms),
n_matched_terms = pl.col('a').str.count_matches('|'.join(terms)),
)
)
shape: (1, 6)
┌──────────┬───────────┬───────────┬────────────────────────┬──────────────┬─────────────────┐
│ has_term ┆ has_term2 ┆ has_term3 ┆ each_term ┆ whole_terms ┆ n_matched_terms │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool ┆ list[str] ┆ list[str] ┆ u32 │
╞══════════╪═══════════╪═══════════╪════════════════════════╪══════════════╪═════════════════╡
│ true ┆ true ┆ true ┆ ["A", "bug", "u", "g"] ┆ ["A", "bug"] ┆ 4 │
└──────────┴───────────┴───────────┴────────────────────────┴──────────────┴─────────────────┘