pythonhuggingface-datasets

Is there any way to download only a partition of the whole dataset from huggingface


I am trying to finetune a facebook/wav2vec2 model on Automatic Speech Recognition (ASR) with common voice dataset, but I stumbled upon an issue that my disk space is not enough to hold this large 256GB dataset.

Then I tried dataset splitting and slicing from datasets.load_dataset("mozilla-foundation/common_voice_16_0", "en", split="train[:20%]"), but, instead of loading 20 percent of the dataset, the whole dataset was loaded anyway.

I found a similar issue, where Mohammad replied that the whole dataset will be loaded regardless of slicing argument, but his answered wasn't confirmed, so I just want to ask if it is really the case that we cannot load only a part of dataset?


Solution

  • datasets works by downloading the dataset completely and returning the requested data.

    To download partial data you can consider streaming instead:

    from datasets import load_dataset
    
    num_samples_to_take = 100
    dataset_name = "mozilla-foundation/common_voice_16_0"
    ds = load_dataset(dataset_name, "en", split="train", streaming=True)
    ds = ds.take(num_samples_to_take)