I have a simple X_train and Y_train data:
x_train = [
array([ 6, 1, 9, 10, 7, 7, 1, 9, 10, 3, 10, 1, 4]),
array([ 2, 8, 8, 1, 1, 4, 2, 5, 1, 2, 7, 2, 1, 1, 4, 5, 10, 4])
]
y_train = [23, 17]
Arrays are numpy arrays.
I am now trying to use the tf.data.Dataset
class to load these as tensors.
Before I have done a similar thing successfully using the following code:
dataset = data.Dataset.from_tensor_slices((x_train, y_train))
As this input is fed into a RNN, I have used the expand_dims method in the first RNN layer (the expand_dimension is passed as a function to overcome an apparent bug in tensorflow: see https://github.com/keras-team/keras/issues/5298#issuecomment-281914537):
def expand_dimension(x):
from tensorflow import expand_dims
return expand_dims(x, axis=-1)
model = models.Sequential(
[
layers.Lambda(expand_dimension,
input_shape=[None]),
layers.LSTM(units=64, activation='tanh'),
layers.Dense(units=1)
]
)
This worked although because I had arrays of equal length. In the example I posted instead the 1st array has 13 numbers and the 2nd one 18.
In this case the method above doesn't work, and the recommended method seems to be using tf.data.Dataset.from_generator
.
Reading this How to use the Tensorflow Dataset Pipeline for Variable Length Inputs?, the accepted solution shows something like the following would work (where I am not caring here about y_train
for simplicity):
dataset = tf.data.Dataset.from_generator(lambda: x_train,
tf.as_dtype(x_train[0].dtype),
tf.TensorShape([None, ]))
However, the syntax in tensorflow has changed since this answer, and now it requires to use the output_signature
argument (see https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator).
I've tried different ways but I'm finding hard to understand from tensorflow documentation what the output_signature
should exactly be in my case.
Any help would be much appreciated.
Short answer is, you can define output_signature
as follows.
import tensorflow as tf
import numpy as np
x_train = [
np.array([ 6, 1, 9, 10, 7, 7, 1, 9, 10, 3, 10, 1, 4]),
np.array([ 2, 8, 8, 1, 1, 4, 2, 5, 1, 2, 7, 2, 1, 1, 4, 5, 10, 4])
]
y_train = [23, 17]
dataset = tf.data.Dataset.from_generator(
lambda: x_train,
output_signature=tf.TensorSpec(
[None, ],
dtype=tf.as_dtype(x_train[0].dtype)
)
)
I'll also expand and improve on some things you're doing here to improve your pipeline.
dataset = tf.data.Dataset.from_generator(
lambda: zip(x_train, y_train),
output_signature=(
tf.TensorSpec([None, ], dtype=tf.as_dtype(x_train[0].dtype)),
tf.TensorSpec([], dtype=tf.as_dtype(y_train.dtype))
)
)
for x in dataset:
print(x)
Which would output,
(<tf.Tensor: shape=(13,), dtype=int64, numpy=array([ 6, 1, 9, 10, 7, 7, 1, 9, 10, 3, 10, 1, 4])>, <tf.Tensor: shape=(), dtype=int64, numpy=23>)
(<tf.Tensor: shape=(18,), dtype=int64, numpy=
array([ 2, 8, 8, 1, 1, 4, 2, 5, 1, 2, 7, 2, 1, 1, 4, 5, 10,
4])>, <tf.Tensor: shape=(), dtype=int64, numpy=17>)
Caveat: This can get slightly more complicated if you try to tf.data.Dataset.batch()
items. Then you need to use RaggedTensorSpec
instead of TensorSpec
. Also, I haven't experimented too much with feeding in ragged tensors into a RNN. But I think those are out of scope for the question you've asked.