Writing xarray
datasets to AWS S3 takes a surprisingly big amount of time, even when no data is actually written with compute=False
.
Here's an example:
import fsspec
import xarray as xr
x = xr.tutorial.open_dataset("rasm")
target = fsspec.get_mapper("s3://bucket/target.zarr")
task = x.to_zarr(target, compute=False)
Even without actually computing it, to_zarr
takes around 6 seconds from an EC2 that's in the same region as the S3 bucket.
Looking at the debug logs, there seems to be quite a bit of redirecting going on, as the default region in aiobotocore
is set to us-east-2
while the bucket is in eu-central-1
.
If I first manually put the default region into the environment variables with
os.environ['AWS_DEFAULT_REGION'] = 'eu-central-1'
then the required time drops to around 3.5 seconds.
So my questions are:
Is there any way to pass the region to fsspec
(or s3fs
)? I've tried adding s3_additional_kwargs={"region":"eu-central-1"}
to the get_mapper
method, but that didn't do anything.
Is there any better way to interface with zarr on S3 from xarray
than the above (with fsspec
)?
versions:
xarray: 0.17.0
zarr: 2.6.1
fsspec: 0.8.4
While checking their documentation, for s3fs documentation they show region_name
as a kwargs
and also the fsspec issue regarding using the region
So you can use something like client_kwargs={'region_name':'eu-central-1'}
in the get_mapper
, like:
fsspec.get_mapper("s3://bucket/target.zarr",
client_kwargs={'region_name':'eu-central-1'})
Also, zarr
is widely popular for a huge dataset.