huggingface-datasets

Can I convert an `IterableDataset` to ` Dataset`?


I want to load a large dataset, apply some transformations to some fields, sample a small section from the results and store as files so I can later on just load from there.

Basically something like this:

ds = datasets.load_dataset("XYZ", name="ABC", split="train", streaming=True)
ds = ds.map(_transform_record)
ds.shuffle()[:N].save_to_disk(...)

IterableDataset doesn't have a save_to_disk() method. Makes sense as it's backed by an iterator, but then I'd expect some way to convert an iterable to a regular dataset (by iterating over it all and store in memory/disk, nothing too fancy).

I tried to use Dataset.from_generator() and use the IterableDataset as the generator (iter(ds)), but it doesn't work as it's trying to serialize the generator object.

Is there an easy way, like to_iterable_dataset() just vice-versa?


Solution

  • You must cache an IterableDataset to disk to load it as a Dataset. One way to do this is with Dataset.from_generator:

    from functools import partial
    from datasets import Dataset
    
    def gen_from_iterable_dataset(iterable_ds):
        yield from iterable_ds
    
    ds = Dataset.from_generator(partial(gen_from_iterable_dataset, iterable_ds), features=iterable_ds.features)