Consider this code
import pyarrow.parquet as pq
from gensim.models import Word2Vec
parquet_file = pq.ParquetFile('/mybigparquet.pq')
for i in parquet_file.iter_batches(batch_size=100):
print("training on batch")
batch = i.to_pandas()
model = Word2Vec(sentences= batch.tokens, vector_size=100, window=5, workers=40, min_count = 10,epochs = 10)
As you can see, I am trying to train a word2vec
model using a very large parquet file that does not fit entirely into my RAM. I know that gensim
can operate on iterables (not generators, as the data need to be scanned twice in word2vec) and I know that pyarrow
allows me to generate batches (of even one row) from the file.
Yet, this code is not working correctly. I think I need to write my pyarrow
loop as a proper generator but I do not know how to do this.
What do you think? Thanks!
You're creating a new model every iteration of your loop, so at best, you'd only have, at the end, a single model
from the last iteration – probably not what you want.
Instead, you need to give Word2Vec
the kind of re-iterable Python sequence that it wants, where each item is a Python list of string word tokens.
Unless there's something in pyarrow
/etc, this likely means your own class, that implements the Python 'iterable' interface, so that each time an iterator is requested from it, it starts a new iteration. Here's a reasonable article section about how iterables work. You'd be likely to make use of the generator pattern to achieve this, as mentioned later in the same article.
(Essentially: your __iter__()
special function will start the iteration, and loop over the batches, always using yield
to return the single next item. If you class is specialized for use as a Gensim Word2Vec
corpus, each item yielded should be a Python list
with string tokens.)
If you're really only using one column named 'tokens', no need to convert things through a Pandas DataFrame
- that's just extra conversion/storage overhead. Just ensure the one column of interest is processed the right way – if each entry is already a Python list-of-strings, you're fine, but if it's a string, you might have to break into a list-of-tokens.
When you're doing it right, you'll only pass the entire corpus's iterable into the Word2Vec
class once - it will then read it, over the many passes required for training, as it needs it by requesting repeated iterations. (It has no idea where the data is coming from, or how your iterable is implemented. It just knows it's got an object that it can request an iterator from, and that iterator gives it one of the right kind of item each next()
, until reaching the end of one iteration.)