I have to run some tests on different environments. In tests I have to check some directories in s3 to find parquet files and transfer them to dictionary like this
import pyarrow.parquet as pq
import s3fs
env = 'dev'
aws_profile ={'dev': 'dev_aws_profile', 'qa': 'qa_aws_profile'}
def get_dictionary_from_parquet(file_name):
fs = s3fs.S3FileSystem()
pq_session = Session(profile_name=aws_profile[env])
s3 = pq_session.resource('s3')
parquet_bucket = s3.Bucket(f'valid-bucket-name-{env}')
paths = []
for pq_file in parquet_bucket.objects.filter(Prefix=f'valid-prefix-{env}'):
if pq_file.key.endswith(file_name):
paths.append(f's3://{pq_file.bucket_name}/{pq_file.key}')
data_set = pq.ParquetDataset(paths, filesystem=fs)
tbl = data_set.read()
pq_dictionary = tbl.to_pydict()
return pq_dictionary
it works perfectly if aws_profile == default profile in aws credentials file, but it returns
line 14, in get_dictionary_from_parquet
data_set = pq.ParquetDataset(paths, filesystem=fs)
File "/Library/Python/3.7/site-packages/pyarrow/parquet.py", line 1170, in __init__
open_file_func=partial(_open_dataset_file, self._metadata)
File "/Library/Python/3.7/site-packages/pyarrow/parquet.py", line 1365, in _make_manifest
.format(path))
OSError: Passed non-file path: s3://<valid path to parquet file>
how to parse aws profile creds to pyarrow to fix it?
It is weird that you are configuring and doing your file filtering on boto(3?) object, while using the s3fs
instance to specify the filesystem when reading. I recommend using s3fs
for both.
The following will fix it
fs = s3fs.S3FileSystem(profile=aws_profile[env])
but I would suggest that you can use the same instance to do your file listing too
paths = fs.glob(f"valid-bucket-name-{env}/valid-prefix-{env}/*/file_name")
(or whatever the right glob pattern is - I had trouble parsing your code).