I am currently stuck in a dead end. I am trying to make an image caption generator from a federated approach. My initial idea was to have a different tokenizer for each client. That poses these issues however:
Every client will have a different sized vocabulary, and thus a different shape of y, which will cause issues with the global model configuration.
To counter the above issue, I could make size of y in each client equivalent to the largest size across all clients, and fill the extra columns in each client with 0. Example: [0,1,1,1] mapped to a size of 6 would become [0,1,1,1,0,0]
This brings me to the last possible flaw, which is that the same words in different clients will be having different indices. A word "rock" in client 1 might have an index of 6, while the same can have an index of 9 in another client. While training the global model, it will cause issues since the model is trying to learn different label indices for the same word, which will impact the accuracy?
This brings me to the final question: Is it against the idea of Federated Learning to tokenize all the words of all the training clients in a single tokenizer?
It depends. In Federated Learning if everyone has the same of some value it could be thought of as public information. Global vocabulary definitions could fit this criteria.
For example we can take the tff.federated_broadcast
intrinsic, which sends every client the same value. Each participant reveals nothing to the server nor the other participants about its own data. This is how the global model is served to the clients in algorithms in the FedAvg family. All clients start from the same model weights, and sending a mapping of strings to token ids would not reveal additional information about a particular user. That said, technologies such as Private Information Retrieval protocols could be used to send different data to each client without clients revealing what they are asking for to the server. TFF has initial stubs for such protocols in the tff.federated_secure_select
intrinsic. The tutorial Client-efficient large-model federated learning via federated_select and sparse aggregation has examples.
Where one needs to be careful is in the aggregation step (when clients send their model updates back to the server). As you noticed, a global vocabulary will be necessary, otherwise different clients will learn different parameters for different words and it will be unknown how to combine them later. However, if I'm the only participant with the word foo
, its possible my model update will reveal the fact that I have that word (or otherwise memorize something about my data: https://xkcd.com/2169/). In this case one can combine FL with Differential Privacy to improve the privacy of the model. The tutorial Differential Privacy in TFF has examples of how this can be done in TFF.