tensorflowtensortensorflow-datasets

TensorFlow dataset with multi-dimensional Tensors from a CSV file


Is there a way, and if yes, what it is, to load a TensorFlow dataset with multi-dimensional feature Tensor from a CSV (or other format input) file?

For example, my CSV input looks like the following:

f1,  f2,  f3,                      label
0.1, 0.2, 0.1;0.2;0.3;1.1;1.2;1.3, 1
0.2, 0.3, 0.2;0.3;0.4;1.2;1.3;1.4, 0
0.3, 0.4, 0.3;0.4;0.5;1.3;1.4;1.5, 1

I'd like load a dataset from such file, e.g.

import tensorflow as tf

frames_csv_ds = tf.data.experimental.make_csv_dataset(
    'input.csv',
    header=False,
    column_names=['f1','f2','f3','label'],
    batch_size=5,
    label_name='label',
    num_epochs=1,
    ignore_errors=True,)

for batch, label in frames_csv_ds.take(1):
  for key, value in batch.items():
    print(f"{key:20s}: {value}")
  print()
  print(f"{'label':20s}: {label}")

To get the batch as:

f1 : [0.1   0.2   0.3  ]
f2 : [0.2   0.3   0.4  ]
f3 : [ [[0.1, 0.2, 0.3], [1.1, 1.2, 1.3]], [[0.2, 0.3, 0.4], [1.2, 1.3, 1.4]], [[0.3, 0.4, 0.5], [1.3, 1.4, 1.5]] ]
label : [1, 0, 1]

The snippet above is incomplete and doesn't work. Is there away to get the dataset in the illustrated form? If yes, can this be done for arrays of dimensions varying across the dataset?


Solution

  • Well, you can do this by customizing some Tensorflow Functions

    import tensorflow as tf
    
    file_path = "data.csv"
    dataset = tf.data.TextLineDataset(file_path).skip(1)
    
    def parse_csv_line(line):
      # Split the line into a list of strings
      fields = tf.io.decode_csv(line, record_defaults=[[""]] * 4)
      
      f1 = tf.strings.to_number(fields[0], tf.float32)
      f2 = tf.strings.to_number(fields[1], tf.float32)
      f3 = tf.strings.to_number(tf.strings.split(fields[2], ";"), tf.float32)
      label = tf.strings.to_number(fields[3], tf.int32)
      
      return {"f1": f1, "f2": f2, "f3": f3, "label": label}
    
    dataset = dataset.map(parse_csv_line).batch(5)
    
    next(iter(dataset.take(1)))
    
    {'f1': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([0.1, 0.2, 0.3], dtype=float32)>,
     'f2': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([0.2, 0.3, 0.4], dtype=float32)>,
     'f3': <tf.Tensor: shape=(3, 6), dtype=float32, numpy=
     array([[0.1, 0.2, 0.3, 1.1, 1.2, 1.3],
            [0.2, 0.3, 0.4, 1.2, 1.3, 1.4],
            [0.3, 0.4, 0.5, 1.3, 1.4, 1.5]], dtype=float32)>,
     'label': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 0, 1], dtype=int32)>}