I want to use Spacy to generate embeddings of text stored in a polars DataFrame and store the results in the same DataFrame. Next, I want to save this DataFrame to the disk and be able to load again as a polars DataFrame. The backtransformation from pandas to polars results in an error.
This is the error message:
ArrowInvalid: Could not convert Hello with type spacy.tokens.doc.Doc: did not recognize Python value type when inferring an Arrow data type
Here is my code:
from io import StringIO
import polars as pl
import pandas as pd
import spacy
nlp = spacy.load("de_core_news_sm")
json_str = '[{"foo":"Hello","bar":6},{"foo":"What a lovely day","bar":7},{"foo":"Nice to meet you","bar":8}]'
#Initalize and store DataFrame
df = pl.read_json(StringIO(json_str))
df = df.with_columns(pl.col("foo").map_elements(lambda x: nlp(x)).alias("encoding"))
df.to_pandas().to_pickle('pickled_df.pkl')
#Load DataFrame
df_loaded_pd = pd.read_pickle('pickled_df.pkl')
df_loaded_pl = pl.from_pandas(df_loaded_pd)
These are the package versions I used:
# Name Version Build Channel
pandas 2.2.3 py312hf9745cd_1 conda-forge
polars 1.9.0 py312hfe7c9be_0 conda-forge
spacy 3.7.2 py312h6db74b5_0
spacy-curated-transformers 0.2.2 pypi_0 pypi
spacy-legacy 3.0.12 pyhd8ed1ab_0 conda-forge
spacy-loggers 1.0.5 pyhd8ed1ab_0 conda-forge
Thank you for your help!
SpaCy objects within a polars DataFrame can be stored by using SpaCys native DocBin class. The following code generates doc objects, saves them locally, and successfully loads them afterwards.
from io import StringIO
from spacy.tokens import DocBin
import polars as pl
import spacy
nlp = spacy.load("de_core_news_md")
json_str = '[{"foo":"Hello","bar":6},{"foo":"What a lovely day","bar":7},{"foo":"Nice to meet you","bar":8}]'
doc = nlp("some text")
#Serialize polars DataFrame
df = pl.read_json(StringIO(json_str))
df = df.with_columns(pl.col("foo").map_elements(lambda x: DocBin(docs=[nlp(x)]).to_bytes()).alias('binary_embbeding'))
df.write_parquet('saved.pq')
#Deserialize polars DataFrame
df_loaded = pl.read_parquet('saved.pq')
df_loaded = df_loaded.with_columns(pl.col('binary_embbeding').map_elements(lambda x: list(DocBin().from_bytes(x).get_docs(nlp.vocab))[0]).alias("spacy_embedding"))
#Calculate similarity
df_loaded.with_columns(pl.col("spacy_embedding").map_elements(lambda x: doc.similarity(x), return_dtype=pl.Float64).alias('Score'))
Serializing and deserializing SpaCys objects with native polars functions (such as df.write_parquet()
) heavily depends on the used model. In the above case the similarity calculation only works when utilizing SpaCys language model that contain wordvectors.
nlp = spacy.load("de_core_news_sm") # Line 20 does not works
nlp = spacy.load("de_core_news_md") # Line 20 works
nlp = spacy.load("de_core_news_lg") # Line 20 works
nlp = spacy.load("de_dep_news_trf") # Line 20 does not works