I have a GCMLE experiment which has three learning objectives (consider these Task A, Task B, and Task C) within a single model_fn()
. The inputs for all 3 objectives are the same (reading a body of text) and I would like to produce three separate predictions. However, for Task C I would like to properly mask some of the
examples in the batch (~20% across each batch). Is the proper way to do this by simply weighting those samples that I want to mask by zero? Consider this loss function..
lossA = tf.reduce_mean(tf.losses.sparse_softmax_cross_entropy(
labels=labelsA, logits=logitsA))
lossB = tf.reduce_mean(tf.losses.sparse_softmax_cross_entropy(
labels=labelsB, logits=logitsB))
mask_weights = tf.to_float(tf.equal(x, y)) # returns 1 if x equals y, returns 0 if x != y
lossC = tf.reduce_mean(tf.losses.sparse_softmax_cross_entropy(
labels=labelsC, logits=logitsC, weights=mask_weights))
loss = lossA + lossB + lossC
Essentially what I am trying to do is mask any samples in the batch where x != y so that there are no gradient updates to the model based on these examples as they relate to taskC. Is this anywhere near the desired effect? Is there a better way to implement the desired behavior?
I realize that I could split these up into separate experiments, but I would like to be able to have a shared embedding and also a single graph which I can upload into the GCMLE prediction service.
To summarize the comments -- Applying a binary mask to the loss function as described in the post seems to be the appropriate way to mask the loss function. However, there may be other unintended consequences from reducing the effetive batch size of C that would discourage this approach.