use gensim with a pyarrow iterable

Consider this code

import pyarrow.parquet as pq
from gensim.models import Word2Vec

parquet_file = pq.ParquetFile('/mybigparquet.pq')
for i in parquet_file.iter_batches(batch_size=100):
    print("training on batch")
    batch =  i.to_pandas()
  
    model = Word2Vec(sentences= batch.tokens, vector_size=100, window=5, workers=40, min_count = 10,epochs = 10)

As you can see, I am trying to train a word2vec model using a very large parquet file that does not fit entirely into my RAM. I know that gensim can operate on iterables (not generators, as the data need to be scanned twice in word2vec) and I know that pyarrow allows me to generate batches (of even one row) from the file.

Yet, this code is not working correctly. I think I need to write my pyarrow loop as a proper generator but I do not know how to do this.

What do you think? Thanks!

Solution

You're creating a new model every iteration of your loop, so at best, you'd only have, at the end, a single model from the last iteration – probably not what you want.

Instead, you need to give Word2Vec the kind of re-iterable Python sequence that it wants, where each item is a Python list of string word tokens.

Unless there's something in pyarrow/etc, this likely means your own class, that implements the Python 'iterable' interface, so that each time an iterator is requested from it, it starts a new iteration. Here's a reasonable article section about how iterables work. You'd be likely to make use of the generator pattern to achieve this, as mentioned later in the same article.

(Essentially: your __iter__() special function will start the iteration, and loop over the batches, always using yield to return the single next item. If you class is specialized for use as a Gensim Word2Vec corpus, each item yielded should be a Python list with string tokens.)

If you're really only using one column named 'tokens', no need to convert things through a Pandas DataFrame - that's just extra conversion/storage overhead. Just ensure the one column of interest is processed the right way – if each entry is already a Python list-of-strings, you're fine, but if it's a string, you might have to break into a list-of-tokens.

When you're doing it right, you'll only pass the entire corpus's iterable into the Word2Vec class once - it will then read it, over the many passes required for training, as it needs it by requesting repeated iterations. (It has no idea where the data is coming from, or how your iterable is implemented. It just knows it's got an object that it can request an iterator from, and that iterator gives it one of the right kind of item each next(), until reaching the end of one iteration.)