I would like to apply a specific function to specific columns using polars similar to the following question:
Above question works with pandas and it is taking ages for me to run it on my computer. So, I would like to use polars. Taking from the above question:
df = pd.DataFrame({'source': ['Paul', 'Paul'],
'target': ['GOOGLE', 'Ferrari'],
'edge': ['works at', 'drive']
})
source target edge
0 Paul GOOGLE works at
1 Paul Ferrari drive
Expected outcome with polars:
source target edge Entitiy
0 Paul GOOGLE works at Person
1 Paul Ferrari drive Person
!python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load('en_core_web_sm')
df['Entities'] = df['Text'].apply(lambda sent: [(ent.label_) for ent in nlp(sent).ents])
df['Entities'][1]
How can I add a column with label(Person) to the current dataframe with polars? Thank you.
You can run the apply in Polars with the following code:
df_pl.with_columns(
entities = pl.col('target').map_elements(
lambda sent: [(ent.label_) for ent in nlp(sent).ents])
)
As @jqurious mentioned, this should not be expected to be faster than Pandas. I ran a couple of tests and it takes the same time as Pandas.
In addition to the comments by @jqurious, you could reduce the number of times the apply function is called if some values are repeated.
You can do that by redefining the function with lru_cache:
from functools import lru_cache
import spacy
import polars as pl
nlp = spacy.load('en_core_web_sm')
@lru_cache(1024)
def cached_nlp(text):
return nlp(text)
df_pl.with_columns(
entities = pl.col('target').map_elements(
lambda sent: [(ent.label_) for ent in cached_nlp(sent).ents])
)