numpytensorflowdatasetdata-pipeline

Feeding .npy (numpy files) into tensorflow data pipeline


Tensorflow seems to lack a reader for ".npy" files. How can I read my data files into the new tensorflow.data.Dataset pipline? My data doesn't fit in memory.

Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as features and a scalar as their label.


Solution

  • Does your data fit into memory? If so, you can follow the instructions from the Consuming NumPy Arrays section of the docs:

    Consuming NumPy arrays

    If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().

    # Load the training data into two NumPy arrays, for example using `np.load()`.
    with np.load("/var/data/training_data.npy") as data:
      features = data["features"]
      labels = data["labels"]
    
    # Assume that each row of `features` corresponds to the same row as `labels`.
    assert features.shape[0] == labels.shape[0]
    
    dataset = tf.data.Dataset.from_tensor_slices((features, labels))
    

    In the case that the file doesn't fit into memory, it seems like the only recommended approach is to first convert the npy data into a TFRecord format, and then use the TFRecord data set format, which can be streamed without fully loading into memory.

    Here is a post with some instructions.

    FWIW, it seems crazy to me that TFRecord cannot be instantiated with a directory name or file name(s) of npy files directly, but it appears to be a limitation of plain Tensorflow.

    If you can split the single large npy file into smaller files that each roughly represent one batch for training, then you could write a custom data generator in Keras that would yield only the data needed for the current batch.

    In general, if your dataset cannot fit in memory, storing it as one single large npy file makes it very hard to work with, and preferably you should reformat the data first, either as TFRecord or as multiple npy files, and then use other methods.