I have a Polars dataframe corpus
with one string column, and millions of rows.
I also have a list of substrings substrings
.
I can take a substring and query in how many rows that substring appears with:
corpus.select(pl.col('contents').str.contains(substrings[0]).sum()).item()
This works well for one substring, but I have 10,000 substrings to check. What is the most efficient way in Polars to check all of them?
I have considered converting substrings
into its own polars dataframe, and then performing an inner-join on substring presence, grouping by keyword, then counting the size of the groups. However, this seems very expensive from a RAM overhead perspective, and I am limited on RAM.
Is there a better/cleaner way?
Current slow approach:
import polars as pl
substrings = pl.DataFrame({'substring': ['a', 'b', 'c']})
corpus = pl.DataFrame({'contents': ['aBMMmcICmY', 'ORqkIJCwjV', 'JTQHufYApo', 'SNoqiJxpMY', 'SYbEsasrzt', 'XLinDPSRld', 'iInkOGqBDU', 'vBtykwGOqN', 'ZIpOdkkXBd', 'iUokuiefBS']})
def count_occurrences(substring):
return corpus.select(pl.col('contents').str.contains(substring).sum()).item()
substrings = substrings.with_columns(pl.col('substring').map_elements(count_occurrences).alias('frequency'))
Outputting:
shape: (3, 2)
┌───────────┬───────────┐
│ substring ┆ frequency │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════════╪═══════════╡
│ a ┆ 2 │
│ b ┆ 1 │
│ c ┆ 1 │
└───────────┴───────────┘
substrings = ["ab", "abc", "c"]
df = pl.DataFrame({
"contents": ["abcMMmcICm", "ckIJCwjVab", "JTQHufYcpo", "SNoqabcpMY", "SYbEsasrzt"]
})
.str.extract_many()
has been the fastest way I've found to do this.
df.with_columns(
pl.col("contents").str.extract_many(substrings).alias("substring")
)
shape: (5, 2)
┌────────────┬──────────────────┐
│ contents ┆ substring │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════════╪══════════════════╡
│ abcMMmcICm ┆ ["ab", "c", "c"] │ # <- did not find 'abc'
│ ckIJCwjVab ┆ ["c", "ab"] │
│ JTQHufYcpo ┆ ["c"] │
│ SNoqabcpMY ┆ ["ab", "c"] │
│ SYbEsasrzt ┆ [] │
└────────────┴──────────────────┘
You need to pass overlapping=True
to get "all" the matches.
shape: (5, 1)
┌─────────────────────────┐
│ substring │
│ --- │
│ list[str] │
╞═════════════════════════╡
│ ["ab", "abc", "c", "c"] │ # <- "c" is also found twice
│ ["c", "ab"] │
│ ["c"] │
│ ["ab", "abc", "c"] │
│ [] │
└─────────────────────────┘
We don't want to count the same row twice, so we add a row index, .explode()
and .unique()
If you .group_by()
the substring, the length of the group is the count.
(
df.select(
pl.col("contents").str.extract_many(substrings, overlapping=True)
.alias("substring")
)
.with_row_index()
.explode("substring")
.unique()
.group_by("substring")
.len()
.drop_nulls() # empty lists will be null
)
shape: (3, 2)
┌───────────┬─────┐
│ substring ┆ len │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════════╪═════╡
│ c ┆ 4 │
│ abc ┆ 2 │
│ ab ┆ 3 │
└───────────┴─────┘
Substrings that did not match will not be in the output, you would combine them in if required.