tensorflow machine-learning keras deep-learning tf.data.dataset

Make .zip of discreate dataset using tensorflow

I have two dummy image dataset with three elements in the first and 6 elements in the second dataset.

like 1st dataset images name = [1.png, 2.png, 3.png]

2nd dataset images name = [1_1.png, 1_2.png, 2_1.png, 2_2.png, 3_1.png, 3_2.png]

I'm try to figure out, how to make a zip of these datasets in such a way to map these two datasets that [1.png has to map with 1_1.png and 1_2.png], and [2.png has to map with 2_1.png and 2_2.png] and so on. Is this possible? Here is the code I was trying to implement but I really don't know how to do this.

code

import os
import tensorflow as tf

X=tf.data.Dataset.list_files('D:/test/clear/*.png',shuffle=False)
Y=tf.data.Dataset.list_files('D:/test/haze/*.png',shuffle=False)
paired=tf.data.Dataset.zip((X,Y))
for x in paired:
    print(x)

Results

(<tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\clear\\1.png'>, <tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\haze\\1_1.png'>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\clear\\2.png'>, <tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\haze\\1_2.png'>)

Results I want

(<tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\clear\\1.png'>, <tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\haze\\1_1.png'>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\clear\\1.png'>, <tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\haze\\1_2.png'>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\clear\\2.png'>, <tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\haze\\2_1.png'>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\clear\\2.png'>, <tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\haze\\2_2.png'>)

Solution

(This is my first ever answer written on StackOverflow, so I hope that it will be clear (enough) and without too many formatting errors.)

The easiest way I can think of right now is by duplicating the file names of X.

These are the dummy filepath lists I used:

files_x = ["D:\\test\\clear\\1.png", "D:\\test\\clear\\2.png", "D:\\test\\clear\\3.png"] 
files_y = ["D:\\test\\haze\\1_1.png", "D:\\test\\haze\\1_2.png",  "D:\\test\\haze\\2_1.png", "D:\\test\\haze\\2_2.png", "D:\\test\\haze\\3_1.png", "D:\\test\\haze\\3_2.png"]

First, you create a dataset based on the list of file paths.

ds_files_x_dup = tf.data.Dataset.from_tensor_slices(files_x)

Then you can repeat the elements by applying tf.repeat to each element via the map function. This, however, leads to the repeated elements being grouped as one sample. To get a dataset with one element per sample you then have to apply flat_map on the dataset.

ds_files_x_dup = ds_files_x_dup.map(lambda x: tf.repeat(x,2))
ds_files_x_dup = ds_files_x_dup.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x))

Now you just have to create the dataset based on files_y:

ds_files_y = tf.data.Dataset.from_tensor_slices(files_y)

And zip the two together:

paired = tf.data.Dataset.zip((ds_files_x_dup, ds_files_y))

The elements of paired are then:

(<tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\clear\\1.png'>, <tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\haze\\1_1.png'>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\clear\\1.png'>, <tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\haze\\1_2.png'>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\clear\\2.png'>, <tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\haze\\2_1.png'>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\clear\\2.png'>, <tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\haze\\2_2.png'>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\clear\\3.png'>, <tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\haze\\3_1.png'>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\clear\\3.png'>, <tf.Tensor: shape=(), dtype=string, numpy=b'D:\\test\\haze\\3_2.png'>)