pythontensorflowtensorflow2.0tensorflow-datasets

Perform lookup in tensorflow map


I have a TF Dataset with the following schema:

tf_features = {
 'searched_destination_id': tf.io.FixedLenFeature([], tf.int64, default_value=0),
 'booked_acc_id': tf.io.FixedLenFeature([], dtype=tf.int64, default_value=0),
 'user_id': tf.io.FixedLenFeature([], dtype=tf.int64, default_value=0),;
}

I also have a dict like:

candidates = {'111': [123, 444, ...], '222': [555, 888, ...]...}

I'd like to perform a map operation in the following way:


ds.map(lambda x, y: {**x, 'candidates': candidates[x['searched_destination_ufi'].numpy()]})

However I always get: AttributeError: 'Tensor' object has no attribute 'numpy'

when I remove the .numpy() I get TypeError: Tensor is unhashable. Instead, use tensor.ref() as the key.

Do you suggest any solution?


Solution

  • The function dataset.map works in graph mode, where calling .numpy() on a tensor is not possible. You could try using tf.py_function to include the candidates dict into your dataset:

    import tensorflow as tf
    
    tf_features = {
     'searched_destination_ufi': ['111', '222'],
     'booked_hotel_ufi': [2, 4],
     'user_id': [3, 2]
    }
    
    ds = tf.data.Dataset.from_tensor_slices(tf_features)
    
    candidates = {'111': [123, 444], '222': [555, 888]}
    
    def py_func(x):
      x = x.numpy().decode('utf-8')
      return candidates[x]
    
    
    ds = ds.map(lambda x: {**x, 'candidates': tf.py_function(py_func, [x['searched_destination_ufi']], [tf.int32]*2)})
    for x in ds:
      print(x)
    
    {'searched_destination_ufi': <tf.Tensor: shape=(), dtype=string, numpy=b'111'>, 'booked_hotel_ufi': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'user_id': <tf.Tensor: shape=(), dtype=int32, numpy=3>, 'candidates': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([123, 444], dtype=int32)>}
    {'searched_destination_ufi': <tf.Tensor: shape=(), dtype=string, numpy=b'222'>, 'booked_hotel_ufi': <tf.Tensor: shape=(), dtype=int32, numpy=4>, 'user_id': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'candidates': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([555, 888], dtype=int32)>}
    

    Note that [tf.int32]*2 corresponds to the length of the lists in candidates.

    For a more sophisticated approach, you can use tf.lookup.StaticHashTable and tf.gather, which will both work in graph mode:

    import tensorflow as tf
    
    tf_features = {
     'searched_destination_ufi': ['111', '222'],
     'booked_hotel_ufi': [2, 4],
     'user_id': [3, 2]
    }
    
    ds = tf.data.Dataset.from_tensor_slices(tf_features)
    
    candidates = {'111': [123, 444], '222': [555, 888]}
    keys = list(candidates.keys())
    values = tf.constant(list(candidates.values()))
    
    table = tf.lookup.StaticHashTable(
        tf.lookup.KeyValueTensorInitializer(tf.constant(keys), tf.range(len(keys))),
        default_value=-1)
    
    ds = ds.map(lambda x: {**x, 'candidates': tf.gather(values, [table.lookup(x['searched_destination_ufi'])])})
    for x in ds:
      print(x)
    
    {'searched_destination_ufi': <tf.Tensor: shape=(), dtype=string, numpy=b'111'>, 'booked_hotel_ufi': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'user_id': <tf.Tensor: shape=(), dtype=int32, numpy=3>, 'candidates': <tf.Tensor: shape=(1, 2), dtype=int32, numpy=array([[123, 444]], dtype=int32)>}
    {'searched_destination_ufi': <tf.Tensor: shape=(), dtype=string, numpy=b'222'>, 'booked_hotel_ufi': <tf.Tensor: shape=(), dtype=int32, numpy=4>, 'user_id': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'candidates': <tf.Tensor: shape=(1, 2), dtype=int32, numpy=array([[555, 888]], dtype=int32)>}
    

    If the candidates field is of variable length use a ragged tensor and the second approach, the rest of the code remains the same:

    candidates = {'111': [123, 444], '222': [555, 888, 323]}
    keys = list(candidates.keys())
    values = tf.ragged.constant(list(candidates.values()))
    
    {'searched_destination_ufi': <tf.Tensor: shape=(), dtype=string, numpy=b'111'>, 'booked_hotel_ufi': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'user_id': <tf.Tensor: shape=(), dtype=int32, numpy=3>, 'candidates': <tf.RaggedTensor [[123, 444]]>}
    {'searched_destination_ufi': <tf.Tensor: shape=(), dtype=string, numpy=b'222'>, 'booked_hotel_ufi': <tf.Tensor: shape=(), dtype=int32, numpy=4>, 'user_id': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'candidates': <tf.RaggedTensor [[555, 888, 323]]>}