I'm trying to load my pandas dataframe (df) into a Tensorflow dataset with the following command:
target = df['label']
features = df['encoded_sentence']
dataset = tf.data.Dataset.from_tensor_slices((features.values, target.values))
Here's an excerpt from my pandas dataframe:
+-------+-----------------------+------------------+
| label | sentence | encoded_sentence |
+-------+-----------------------+------------------+
| 0 | Hello world | [5, 7] |
+-------+-----------------------+------------------+
| 1 | my name is john smith | [1, 9, 10, 2, 6] |
+-------+-----------------------+------------------+
| 1 | Hello! My name is | [5, 3, 9, 10] |
+-------+-----------------------+------------------+
| 0 | foo baar | [8, 4] |
+-------+-----------------------+------------------+
# df.dtypes gives me:
label int8
sentence object
encoded_sentencee object
But it keeps giving me a Value Error:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
Can anyone tell me how to use the encoded sentences in my Tensorflow dataset? Help would be greatly appreciated!
You can make your Pandas values into a ragged tensor first and then make the dataset from it:
import tensorflow as tf
import pandas as pd
df = pd.DataFrame({'label': [0, 1, 1, 0],
'sentence': ['Hello world', 'my name is john smith',
'Hello! My name is', 'foo baar'],
'encoded_sentence': [[5, 7], [1, 9, 10, 2, 6],
[5, 3, 9, 10], [8, 4]]})
features = tf.ragged.stack(list(df['encoded_sentence']))
target = tf.convert_to_tensor(df['label'].values)
dataset = tf.data.Dataset.from_tensor_slices((features, target))
for f, t in dataset:
print(f.numpy(), t.numpy())
Output:
[5 7] 0
[ 1 9 10 2 6] 1
[ 5 3 9 10] 1
[8 4] 0
Note you may want to use padded_batch
to get batches of examples from the dataset.
EDIT: Since padded-batching does not seem to work with a dataset made from a ragged tensor at the moment, you can also convert the ragged tensor to a regular one first:
import tensorflow as tf
import pandas as pd
df = pd.DataFrame({'label': [0, 1, 1, 0],
'sentence': ['Hello world', 'my name is john smith',
'Hello! My name is', 'foo baar'],
'encoded_sentence': [[5, 7], [1, 9, 10, 2, 6],
[5, 3, 9, 10], [8, 4]]})
features_ragged = tf.ragged.stack(list(df['encoded_sentence']))
features = features_ragged.to_tensor(default_value=-1)
target = tf.convert_to_tensor(df['label'].values)
dataset = tf.data.Dataset.from_tensor_slices((features, target))
batches = dataset.batch(2)
for f, t in batches:
print(f.numpy(), t.numpy())
Output:
[[ 5 7 -1 -1 -1]
[ 1 9 10 2 6]] [0 1]
[[ 5 3 9 10 -1]
[ 8 4 -1 -1 -1]] [1 0]