apache-sparkpysparkpicklecountvectorizerdill

PySpark: Can't pickle CountVectorizerModel - TypeError: Cannot serialize socket object (but why is the socket library being used?)


I noticed that, unlike in Sci-kit learn, the PySpark implementation for CountVectorizer uses the socket library and so I'm unable to pickle it.

Is there any way around this or another way to persist the vectorizer? I need the vectorized model because I take in input text data that I want to convert into the same kind of word vector as is used in the testing data.

I tried looking at the CountVectorizer source code and I couldn't see any obvious uses of the socket library.

Any ideas are appreciated, thanks!

Here's me trying to pickle the model:

with open("vectorized_model.pkl", "wb") as output_file:
  pickle.dump(vectorized_model, output_file) 

Resulting in: TypeError: Cannot serialize socket object

Here's the original creation of the model:

from pyspark.ml.feature import CountVectorizer
import dill as pickle

vectorizer = CountVectorizer()
vectorizer.setInputCol("TokenizedText")
vectorizer.setOutputCol("Tfidf")
vectorized_model = vectorizer.fit(training_data)
vectorized_model.setInputCol("TokenizedText")

Solution

  • So I realized, instead of pickling, I can use vectorized_model.save() and CountVectorizerModel.load() to persist and retrieve the model.