pythondata-analysispython-polars

Apply name-entity recognition on specific dataframe columns with Polars


I would like to apply a specific function to specific columns using polars similar to the following question:

Above question works with pandas and it is taking ages for me to run it on my computer. So, I would like to use polars. Taking from the above question:

df = pd.DataFrame({'source': ['Paul', 'Paul'],
                   'target': ['GOOGLE', 'Ferrari'],
                   'edge': ['works at', 'drive']
                   })
    source  target  edge
0   Paul    GOOGLE  works at
1   Paul    Ferrari drive

Expected outcome with polars:

    source  target  edge      Entitiy
0   Paul    GOOGLE  works at  Person
1   Paul    Ferrari drive     Person
!python -m spacy download en_core_web_sm

import spacy
nlp = spacy.load('en_core_web_sm')
df['Entities'] = df['Text'].apply(lambda sent: [(ent.label_) for ent in nlp(sent).ents])  
df['Entities'][1]

How can I add a column with label(Person) to the current dataframe with polars? Thank you.


Solution

  • You can run the apply in Polars with the following code:

    df_pl.with_columns(
        entities = pl.col('target').map_elements(
            lambda sent: [(ent.label_) for ent in nlp(sent).ents])
    )
    

    As @jqurious mentioned, this should not be expected to be faster than Pandas. I ran a couple of tests and it takes the same time as Pandas.

    In addition to the comments by @jqurious, you could reduce the number of times the apply function is called if some values are repeated.

    You can do that by redefining the function with lru_cache:

    from functools import lru_cache
    import spacy
    import polars as pl
    
    nlp = spacy.load('en_core_web_sm')
    
    @lru_cache(1024)
    def cached_nlp(text):
        return nlp(text)
    
    df_pl.with_columns(
        entities = pl.col('target').map_elements(
            lambda sent: [(ent.label_) for ent in cached_nlp(sent).ents])
    )