I am trying to find out how I could train a word2vec model in a federated way.
The data would be split into multiple parts, e.g. 4 "institutions", and I would like to train the word2vec model on the data from each institution separately. They key restraint here is that the data from the institutions can not be moved to another location, so it can never be trained in a centralized way.
I know that it is possible to train the word2vec model iteratively, such that the data from the first institution is read and used to train & update the word2vec model, but I wonder if its possible to do it simultaneously on all four institutions and then, for example, to merge all four word2vec models into one model.
Any ideas or suggestions are appreciated
There's no official support in Gensim, so any approach would involve a lot of custom research-like innovation.
Neural models like the word2vec algorithm (but not Gensim) have been trained in a very-distributed/parallel fashion – see for example 'Hogwild' & related followup work, for asynchronous SGD. Very roughly, many separate simultaneous processes train separately & asynchronously, but keep updating each other intermittently, even without locking – & it works OK. (See more llinks in prior answer: https://stackoverflow.com/a/66283392/130288.)
But:
So: it's something a project could try to simulate, or test in practice, though the extra lags/etc of cross-"institution" updates might make it unpractical or ineffective. (And, they'd still have to initially consense on a shared vocabulary, which without due care would leak aspects of each's data.)
As you note, you could consider an approach where each trains one shared model in serial turns, which coudl very closely simulate a single training, albeit with the overhead of passing the interim model around, and no parallelism. Roughly:
.train()
would manually manage item counts & alpha
-related values to simulate one single SGD runNote that there'd still be some hints of each instititions relative co-occurrences of terms, which would leak some info about their private datasets – perhaps most clearly on rare terms.
Still, if you weren't in a rush, that'd best simulate a single integrated model training.
I'd be tempted to try to fix the sharing concerns with some other trust-creating process or intermediary. (Is there an 3rd party that each could trust with their data, temporarily? Could a single shared training system be created which could only stream the individual datasets in for training, with no chance of saving/summarizing the full data? Might 4 cloud hosts, each under the separate institution's sole management but physically in a shared facility effect the above 'serial turns' approach with hardly any overhead?)
There's also the potential to map one model into another: taking a number of shared words as reference anchor points, learning a projection from one model to the other, which allows other non-reference-point words to be moved from one coordinate space to the other. This is has been mentioned as a tool for either extending a vocabulary with vectors from elsewhere (eg section 2.2 of the Kiros et al 'Skip-Thought Vectors' paper) or doing language translation (Mikolov et al 'Exploiting Similarities among Languages for Machine Translation' paper).
Gensim includes a TranslationMatrix
class for learning such projections. Conceivably the institutions could pick one common dataset, or one institution with the largest dataset, as the creator of some canonical starting model. Then each institution creates their own models based on private data. Then, based on some set of 'anchor words' (that are assumed to have stable meaning across all models, perhaps because they are very common) each of these followup models are projected into the canonical space - allowing words that are either unique to each model to be moved into the shared model, or words that vary a lot across models to be projected to contrasting points in the same space (that it might then make sense to average together).