I would like to read a S3 directory with multiple parquet files with same schema. The implemented code works outside the proxy, but the main problem is when enabling the proxy, I'm facing the following issue.
Traceback (most recent call last):
File "script.py", line 158, in <module>
df = pq.read_table(source=bucket_path, filesystem=s3).to_pandas()
File "pyarrow\parquet\__init__.py", line 2737, in read_table
dataset = _ParquetDatasetV2(
File "\pyarrow\parquet\__init__.py", line 2351, in __init__
self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
File "pyarrow\dataset.py", line 694, in dataset
return _filesystem_dataset(source, **kwargs)
File "pyarrow\dataset.py", line 447, in _filesystem_dataset
factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
File "pyarrow\_dataset.pyx", line 2031, in pyarrow._dataset.FileSystemDatasetFactory.__init__
File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 's3://test/files/part-00000-ed788628-0a6d-4ce9-b604-dd4c6ec75b6d-c000.snappy.parquet', which is outside base dir 's3://test/files/'
Here is the code. I commented the other solution I've tried:
import pyarrow.parquet as pq
import s3fs
bucket_path = 's3://test/files/'
os.environ['https_proxy'] = 'http://proxy.com:4200'
# proxies = {
# 'https': f''http://proxy.com:4200',
# 'http': f'http://proxy.com:4200'
# }
# s3 = s3fs.S3FileSystem(anon=False, config_kwargs={'proxies': proxies})
s3 = s3fs.S3FileSystem(anon=False)
df = pq.read_table(source=bucket_path, filesystem=s3).to_pandas()
I couldn't find anyone with the same problem. Any help is welcomed.
Thank you in advance.
Replace:
bucket_path = 's3://test/files/'
with:
bucket_path = '/test/files/'
Since the path will be passed to the given filesystem instance inside of pyarrow, according to the document of fsspec
, the path should come without a scheme.