apache-sparkpysparknlpword2vecapache-spark-mllib

Is the Word2Vec Spark implementation distributed?


I'm relatively new to Spark and having some difficulty understanding Spark ML.

The problem I have is that I have 3TB of text, which I want to train a Word2Vec model on. The server I'm running on has around 1TB of ram and so I can't save the file temporarily.

The file is saved as a parquet that I import into Spark. The question I have is does the Spark ML library distribute the Word2Vec training? If so is there anything I need to worried about while processing such a large text file? If not, is there anyway to stream this data while training Word2Vec?


Solution

  • From this https://github.com/apache/spark/pull/1719 already in 2014 you can glean that parallel processing is possible - per partition.

    Quote:

    To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.

    But you have to have partitoned data.