I am trying to finetune a facebook/wav2vec2 model on Automatic Speech Recognition (ASR) with common voice dataset, but I stumbled upon an issue that my disk space is not enough to hold this large 256GB dataset.
Then I tried dataset splitting and slicing from datasets.load_dataset("mozilla-foundation/common_voice_16_0", "en", split="train[:20%]")
, but, instead of loading 20 percent of the dataset, the whole dataset was loaded anyway.
I found a similar issue, where Mohammad replied that the whole dataset will be loaded regardless of slicing argument, but his answered wasn't confirmed, so I just want to ask if it is really the case that we cannot load only a part of dataset?
datasets
works by downloading the dataset completely and returning the requested data.
To download partial data you can consider streaming instead:
from datasets import load_dataset
num_samples_to_take = 100
dataset_name = "mozilla-foundation/common_voice_16_0"
ds = load_dataset(dataset_name, "en", split="train", streaming=True)
ds = ds.take(num_samples_to_take)