pythonamazon-s3hdf5h5pyceph

Custom endpoint for `ros3` driver in `h5py`


h5py supports the native S3 driver for HDF5 (the ros3 driver). We've enabled this with a local build of HDF5.

>>> import h5py
>>>
>>> print(f'Registered drivers: {h5py.registered_drivers()}')

Registered drivers: frozenset({ 'ros3', 'sec2', 'fileobj', 'core', 'family', 'split', 'stdio', 'mpio'})

We have a custom endpoint for our S3 service (not AWS). We run a Ceph/S3 service.

Is there a way to specify the S3 endpoint?

The documentation here makes no mention of it.

If we attempt to run the following we get a generic error that we presume has to do with the obvious missing endpoint.

>>> h5py.File(
  f, 
  driver='ros3', 
  aws_region=bytes('unused', 'utf-8'),  # unused by our S3 but required
  secret_id=bytes(access_key, 'utf-8'), 
  secret_key=bytes(secret_key, 'utf-8')
)

Traceback (most recent call last):
  File "mycode.py", line 32, in <module>
    with h5py.File(f, driver='ros3', aws_region=b'west', secret_id=access_key, secret_key=secret_key) as fh5:
  File "opt/anaconda3/envs/smart_open_env/lib/python3.9/site-packages/h5py/_hl/files.py", line 567, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "opt/anaconda3/envs/smart_open_env/lib/python3.9/site-packages/h5py/_hl/files.py", line 231, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 106, in h5py.h5f.open
OSError: Unable to open file (curl cannot perform request)

I checked that the commonly used environment variables ENDPOINT and ENDPOINT_URL were set, but had no effect.


Solution

  • The first thing to check is the HDF5 error stack to diagnose the error. You can get it by adding this line to your code: h5py._errors.unsilence_errors().

    Now, if that doesn't work, here is some background on the ros3 driver. It was added to h5py 3.7 (May, 2022). The initial implementation had a logic error with the aws_region, secret_id and secret_key keywords. In 3.7, AWS authentication is set if any of the keywords are input (not blank). However, it should only be set when all 3 keywords are input. The 3.7 logic gets tripped up if you only set 1 or 2. You will get an error message when you try to open the file. So, you either need to set ALL 3 values (for AWS) or NONE of them (eg use=b'' for all 3). This assumes you want anonymous access without authentication to read public files.

    Note: The logic error is fixed in 3.8 (January, 2023).
    In addition, 3.8 adds support for S3 URLs like this: f's3://{endpoint}/{bucket}/{path/file}'. In 3.7 you would have to do something like f'https://{endpoint}/{bucket}/{path/file}'