h5py
supports the native S3 driver for HDF5 (the ros3
driver). We've enabled this with a local build of HDF5
.
>>> import h5py
>>>
>>> print(f'Registered drivers: {h5py.registered_drivers()}')
Registered drivers: frozenset({ 'ros3', 'sec2', 'fileobj', 'core', 'family', 'split', 'stdio', 'mpio'})
We have a custom endpoint for our S3 service (not AWS). We run a Ceph/S3 service.
Is there a way to specify the S3 endpoint?
The documentation here makes no mention of it.
If we attempt to run the following we get a generic error that we presume has to do with the obvious missing endpoint.
>>> h5py.File(
f,
driver='ros3',
aws_region=bytes('unused', 'utf-8'), # unused by our S3 but required
secret_id=bytes(access_key, 'utf-8'),
secret_key=bytes(secret_key, 'utf-8')
)
Traceback (most recent call last):
File "mycode.py", line 32, in <module>
with h5py.File(f, driver='ros3', aws_region=b'west', secret_id=access_key, secret_key=secret_key) as fh5:
File "opt/anaconda3/envs/smart_open_env/lib/python3.9/site-packages/h5py/_hl/files.py", line 567, in __init__
fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
File "opt/anaconda3/envs/smart_open_env/lib/python3.9/site-packages/h5py/_hl/files.py", line 231, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 106, in h5py.h5f.open
OSError: Unable to open file (curl cannot perform request)
I checked that the commonly used environment variables ENDPOINT
and ENDPOINT_URL
were set, but had no effect.
The first thing to check is the HDF5 error stack to diagnose the error. You can get it by adding this line to your code: h5py._errors.unsilence_errors()
.
Now, if that doesn't work, here is some background on the ros3
driver. It was added to h5py 3.7 (May, 2022). The initial implementation had a logic error with the aws_region, secret_id and secret_key
keywords. In 3.7, AWS authentication is set if any of the keywords are input (not blank). However, it should only be set when all 3 keywords are input. The 3.7 logic gets tripped up if you only set 1 or 2. You will get an error message when you try to open the file. So, you either need to set ALL 3 values (for AWS) or NONE of them (eg use=b'' for all 3). This assumes you want anonymous access without authentication to read public files.
Note: The logic error is fixed in 3.8 (January, 2023).
In addition, 3.8 adds support for S3 URLs like this: f's3://{endpoint}/{bucket}/{path/file}'
. In 3.7 you would have to do something like f'https://{endpoint}/{bucket}/{path/file}'