[SOLVED] Dealing with missing values in tensorflow

Dealing with missing values in tensorflow

I need some guidance on the approach to imputation in tensorflow/deep learning. I am familiar with how scikit-learn handles imputation, and when I map it to the tensorflow ecosystem, I would expect to use preprocessing layers in keras or functions in tensorflow transform to do the imputation. However, at least to my knowledge, these functions do not exist. So I have a few questions:

Is there a reason tied to how deep learning works that these functions do not exist (for example, dense sampling needs to be as accurate as possible, and you have a large amount of data, hence imputation is never required)
If it is not #1, how should one handle imputation in tensorflow? For example, during serving, your input could be missing data, and there's nothing you can do about that. I would think integrating it into preprocessing_fn would be the thing to do.
Is it possible to have the graph do different things during training and serving? For example, train on no missing values data, and if during serving you encounter that situation, do something like ignore that value or set it to a specified default.

Thank you!

Solution

Please refer to Mean imputation for missing data to impute missing values from your data with mean.

In the example below, x is a feature, represented as a tf.SparseTensor in the preprocessing_fn. In order to convert it to a dense tensor, we compute its mean, and set the mean to be the default value when it is missing from an instance.

Answering your third question, TensorFlow Transform builds transformations into the TensorFlow graph for your model so the same transformations are performed at training and inference time. For your mentioned use-case, the below example for imputation would work, because default_value param sets values for indices if not specified. And if default_value param is not set, it defaults to Zero.

Example Code:

def preprocessing_fn(inputs):
  return {
      'x_out': tft.sparse_tensor_to_dense_with_shape(
          inputs['x'], default_value=tft.mean(x), shape=[None, 1])
  }