pythondataframespacypython-polars

How to save and load spacy encodings in a Polars DataFrame


I want to use Spacy to generate embeddings of text stored in a polars DataFrame and store the results in the same DataFrame. Next, I want to save this DataFrame to the disk and be able to load again as a polars DataFrame. The backtransformation from pandas to polars results in an error.

This is the error message:

ArrowInvalid: Could not convert Hello with type spacy.tokens.doc.Doc: did not recognize Python value type when inferring an Arrow data type

Here is my code:

from io import StringIO
import polars as pl
import pandas as pd
import spacy


nlp = spacy.load("de_core_news_sm")
json_str = '[{"foo":"Hello","bar":6},{"foo":"What a lovely day","bar":7},{"foo":"Nice to meet you","bar":8}]'


#Initalize and store DataFrame
df = pl.read_json(StringIO(json_str))
df = df.with_columns(pl.col("foo").map_elements(lambda x: nlp(x)).alias("encoding"))
df.to_pandas().to_pickle('pickled_df.pkl')

#Load DataFrame
df_loaded_pd = pd.read_pickle('pickled_df.pkl')
df_loaded_pl = pl.from_pandas(df_loaded_pd)

These are the package versions I used:

# Name                    Version                   Build  Channel
pandas                    2.2.3           py312hf9745cd_1    conda-forge
polars                    1.9.0           py312hfe7c9be_0    conda-forge
spacy                     3.7.2           py312h6db74b5_0  
spacy-curated-transformers 0.2.2                    pypi_0    pypi
spacy-legacy              3.0.12             pyhd8ed1ab_0    conda-forge
spacy-loggers             1.0.5              pyhd8ed1ab_0    conda-forge

Thank you for your help!


Solution

  • Serializing and deserializing

    SpaCy objects within a polars DataFrame can be stored by using SpaCys native DocBin class. The following code generates doc objects, saves them locally, and successfully loads them afterwards.

    from io import StringIO
    from spacy.tokens import DocBin
    import polars as pl
    import spacy
    
    nlp = spacy.load("de_core_news_md")
    json_str = '[{"foo":"Hello","bar":6},{"foo":"What a lovely day","bar":7},{"foo":"Nice to meet you","bar":8}]'
    doc = nlp("some text")
    
    #Serialize polars DataFrame
    df = pl.read_json(StringIO(json_str))
    df = df.with_columns(pl.col("foo").map_elements(lambda x: DocBin(docs=[nlp(x)]).to_bytes()).alias('binary_embbeding'))
    df.write_parquet('saved.pq')
    
    #Deserialize polars DataFrame
    df_loaded = pl.read_parquet('saved.pq')
    df_loaded = df_loaded.with_columns(pl.col('binary_embbeding').map_elements(lambda x: list(DocBin().from_bytes(x).get_docs(nlp.vocab))[0]).alias("spacy_embedding"))
    
    #Calculate similarity
    df_loaded.with_columns(pl.col("spacy_embedding").map_elements(lambda x: doc.similarity(x), return_dtype=pl.Float64).alias('Score'))
    
    

    Applying functions to deserialized SpaCy objects

    Serializing and deserializing SpaCys objects with native polars functions (such as df.write_parquet()) heavily depends on the used model. In the above case the similarity calculation only works when utilizing SpaCys language model that contain wordvectors.

    nlp = spacy.load("de_core_news_sm") # Line 20 does not works
    nlp = spacy.load("de_core_news_md") # Line 20 works
    nlp = spacy.load("de_core_news_lg") # Line 20 works
    nlp = spacy.load("de_dep_news_trf") # Line 20 does not works