pythonpython-xarrayzarrpython-s3fs

Zarr: improve xarray writing performance to S3


Writing xarray datasets to AWS S3 takes a surprisingly big amount of time, even when no data is actually written with compute=False.

Here's an example:

import fsspec
import xarray as xr

x = xr.tutorial.open_dataset("rasm")
target = fsspec.get_mapper("s3://bucket/target.zarr")
task = x.to_zarr(target, compute=False)

Even without actually computing it, to_zarr takes around 6 seconds from an EC2 that's in the same region as the S3 bucket.

Looking at the debug logs, there seems to be quite a bit of redirecting going on, as the default region in aiobotocore is set to us-east-2 while the bucket is in eu-central-1.

If I first manually put the default region into the environment variables with

os.environ['AWS_DEFAULT_REGION'] = 'eu-central-1'

then the required time drops to around 3.5 seconds.

So my questions are:

  1. Is there any way to pass the region to fsspec (or s3fs)? I've tried adding s3_additional_kwargs={"region":"eu-central-1"} to the get_mapper method, but that didn't do anything.

  2. Is there any better way to interface with zarr on S3 from xarray than the above (with fsspec)?


versions:

xarray: 0.17.0
zarr: 2.6.1
fsspec: 0.8.4

Solution

  • While checking their documentation, for s3fs documentation they show region_name as a kwargs and also the fsspec issue regarding using the region

    So you can use something like client_kwargs={'region_name':'eu-central-1'} in the get_mapper, like:

    fsspec.get_mapper("s3://bucket/target.zarr", 
                      client_kwargs={'region_name':'eu-central-1'})
    

    Also, zarr is widely popular for a huge dataset.