pythonnlphuggingfacehuggingface-datasets

Error when calling Hugging Face load_dataset("glue", "mrpc")


I'm following the huggingface tutorial here and it's giving me a strange error. When I run the following code:

from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from torch.utils.data import DataLoader

raw_datasets = load_dataset("glue", "mrpc")

Here is what I see:

Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 151k/151k [00:00<00:00, 3.35MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.1k/11.1k [00:00<00:00, 6.63MB/s]
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:32<00:00, 10.89s/it]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 127.92it/s]
Traceback (most recent call last):
  File "/Users/ameenizhac/Downloads/transformers_playground.py", line 5, in <module>
    raw_datasets = load_dataset("glue", "mrpc")
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/builder.py", line 967, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/builder.py", line 1709, in _prepare_split
    split_info = self.info.splits[split_generator.name]
                 ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/splits.py", line 530, in __getitem__
    instructions = make_file_instructions(
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/arrow_reader.py", line 112, in make_file_instructions
    name2filenames = {
                     ^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/arrow_reader.py", line 113, in <dictcomp>
    info.name: filenames_for_dataset_split(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/naming.py", line 70, in filenames_for_dataset_split
    prefix = filename_prefix_for_split(dataset_name, split)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/datasets/naming.py", line 54, in filename_prefix_for_split
    if os.path.basename(name) != name:
       ^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen posixpath>", line 142, in basename
TypeError: expected str, bytes or os.PathLike object, not NoneType

I don't know where to start because I don't understand where the error is coming from.


Solution

  • I tried on my PC and on Google Colab. The strange thing is that on Colab it works, on my PC it does not.

    Anyway, a possible workaround is the following:

    raw_datasets = load_dataset("SetFit/mrpc")
    

    If you print it, you will see that the dataset is the same, it just has a different name:

    DatasetDict({
        train: Dataset({
            features: ['text1', 'text2', 'label', 'idx', 'label_text'],
            num_rows: 3668
        })
        test: Dataset({
            features: ['text1', 'text2', 'label', 'idx', 'label_text'],
            num_rows: 1725
        })
        validation: Dataset({
            features: ['text1', 'text2', 'label', 'idx', 'label_text'],
            num_rows: 408
        })
    })