pytorchscalingiterablepytorch-lightningpytorch-dataloader

How to apply min-max scaling on a IterableDataset?


I'm using an iterableDataset because I have massive amounts of data. And since IterableDataset does not store all data in memory, we cannot directly compute min/max on the entire dataset before training. That is because for min-max we need to calculate the min x value and max x value observed in the data. My question would be how would you apply min-max scaling then?

How would you go on about that?

I'm unsure on how to solve this problem since I really have to scale the data as well.


Solution

  • You'll have to iterate over the dataset to compute the min/max values as part of data processing prior to training. Iterate once, compute the min/max values online as you iterate, then save them for future use.

    For datasets too large to store in memory, it can be helpful to use a library like datasets which uses apache arrow as a backend. This allows you to work with the full dataset without needing to load it into memory.