What file structure should I use on the Hugging Face Hub, if I have a /train.zip
archive with PNG image files and an /metadata.csv
file with annotations for them, so that the parquet-converter bot can automatically recognize and correctly interpret this dataset?
An example of the desired result
But regardless of which file structure I use,
https://huggingface.co/datasets/james-r/so-invalid-image-archive-with-metadata-1
/train.zip
/metadata.csv
or
/train/train.zip
/metadata.csv
I get an exception:
Cannot load the dataset split (in streaming mode) to extract the first rows.
Error code: StreamingRowsError
Exception: ValueError
Message: One or several metadata.csv were found, but not in the same directory or in a parent directory of zip://1.png::hf://datasets/[user]/[repo-name]@[hash]/train/train.zip.
Traceback: Traceback (most recent call last):
File "/src/services/worker/src/worker/job_runners/split/first_rows.py", line 322, in compute
compute_first_rows_from_parquet_response(
File "/src/services/worker/src/worker/job_runners/split/first_rows.py", line 88, in compute_first_rows_from_parquet_response
rows_index = indexer.get_rows_index(
File "/src/libs/libcommon/src/libcommon/parquet_utils.py", line 640, in get_rows_index
return RowsIndex(
File "/src/libs/libcommon/src/libcommon/parquet_utils.py", line 521, in __init__
self.parquet_index = self._init_parquet_index(
File "/src/libs/libcommon/src/libcommon/parquet_utils.py", line 538, in _init_parquet_index
response = get_previous_step_or_raise(
File "/src/libs/libcommon/src/libcommon/simple_cache.py", line 590, in get_previous_step_or_raise
raise CachedArtifactError(
libcommon.simple_cache.CachedArtifactError: The previous step failed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/src/services/worker/src/worker/utils.py", line 96, in get_rows_or_raise
return get_rows(
File "/src/libs/libcommon/src/libcommon/utils.py", line 197, in decorator
return func(*args, **kwargs)
File "/src/services/worker/src/worker/utils.py", line 73, in get_rows
rows_plus_one = list(itertools.islice(ds, rows_max_number + 1))
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1389, in __iter__
for key, example in ex_iterable:
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 234, in __iter__
yield from self.generate_examples_fn(**self.kwargs)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/folder_based_builder/folder_based_builder.py", line 376, in _generate_examples
raise ValueError(
ValueError: One or several metadata.csv were found, but not in the same directory or in a parent directory of zip://1.png::hf://datasets/[user]/[repo-name]@[hash]/train/train.zip.
What am I doing wrong?
It seems this is an issue with the datasets
package.
The workaround for this problem is to convert the metadata.csv
file to metadata.jsonl
format.