pythondataframepython-polars

Polars efficient list-of-substrings counting


I have a Polars dataframe corpus with one string column, and millions of rows.
I also have a list of substrings substrings.

I can take a substring and query in how many rows that substring appears with:

corpus.select(pl.col('contents').str.contains(substrings[0]).sum()).item()

This works well for one substring, but I have 10,000 substrings to check. What is the most efficient way in Polars to check all of them?

I have considered converting substrings into its own polars dataframe, and then performing an inner-join on substring presence, grouping by keyword, then counting the size of the groups. However, this seems very expensive from a RAM overhead perspective, and I am limited on RAM.

Is there a better/cleaner way?

Current slow approach:

import polars as pl

substrings = pl.DataFrame({'substring': ['a', 'b', 'c']})
corpus = pl.DataFrame({'contents': ['aBMMmcICmY', 'ORqkIJCwjV', 'JTQHufYApo', 'SNoqiJxpMY', 'SYbEsasrzt', 'XLinDPSRld', 'iInkOGqBDU', 'vBtykwGOqN', 'ZIpOdkkXBd', 'iUokuiefBS']})

def count_occurrences(substring):
    return corpus.select(pl.col('contents').str.contains(substring).sum()).item()

substrings = substrings.with_columns(pl.col('substring').map_elements(count_occurrences).alias('frequency'))

Outputting:

shape: (3, 2)  
┌───────────┬───────────┐  
│ substring ┆ frequency │  
│ ---       ┆ ---       │  
│ str       ┆ i64       │  
╞═══════════╪═══════════╡  
│ a         ┆ 2         │  
│ b         ┆ 1         │  
│ c         ┆ 1         │  
└───────────┴───────────┘  

Solution

  • substrings = ["ab", "abc", "c"]
    
    df = pl.DataFrame({
        "contents": ["abcMMmcICm", "ckIJCwjVab", "JTQHufYcpo", "SNoqabcpMY", "SYbEsasrzt"]
    })
    

    .str.extract_many() has been the fastest way I've found to do this.

    df.with_columns(
        pl.col("contents").str.extract_many(substrings).alias("substring")
    )
    
    shape: (5, 2)
    ┌────────────┬──────────────────┐
    │ contents   ┆ substring        │
    │ ---        ┆ ---              │
    │ str        ┆ list[str]        │
    ╞════════════╪══════════════════╡
    │ abcMMmcICm ┆ ["ab", "c", "c"] │ # <- did not find 'abc'
    │ ckIJCwjVab ┆ ["c", "ab"]      │
    │ JTQHufYcpo ┆ ["c"]            │
    │ SNoqabcpMY ┆ ["ab", "c"]      │
    │ SYbEsasrzt ┆ []               │
    └────────────┴──────────────────┘
    

    You need to pass overlapping=True to get "all" the matches.

    shape: (5, 1)
    ┌─────────────────────────┐
    │ substring               │
    │ ---                     │
    │ list[str]               │
    ╞═════════════════════════╡
    │ ["ab", "abc", "c", "c"] │ # <- "c" is also found twice
    │ ["c", "ab"]             │
    │ ["c"]                   │
    │ ["ab", "abc", "c"]      │
    │ []                      │
    └─────────────────────────┘
    

    We don't want to count the same row twice, so we add a row index, .explode() and .unique()

    If you .group_by() the substring, the length of the group is the count.

    (
        df.select(
            pl.col("contents").str.extract_many(substrings, overlapping=True)
              .alias("substring")
        )
        .with_row_index()
        .explode("substring")
        .unique()
        .group_by("substring")
        .len()
        .drop_nulls() # empty lists will be null
    )
    
    shape: (3, 2)
    ┌───────────┬─────┐
    │ substring ┆ len │
    │ ---       ┆ --- │
    │ str       ┆ u32 │
    ╞═══════════╪═════╡
    │ c         ┆ 4   │
    │ abc       ┆ 2   │
    │ ab        ┆ 3   │
    └───────────┴─────┘
    

    Substrings that did not match will not be in the output, you would combine them in if required.