I am working on ERA5 reanalysis data in NetCDF format, and I need to compute wind direction based on U and V components.
I already have a working piece of Python code to do so using xarray and pandas, and saving output as NetCDF.
While I do not add any additional data (only applying transformation to compute a new variable instead of U and V), I'm getting a huge increase in the size of the ouput file. Input files are about 30 Mo and the output one is about 300 Mo.
Can someone explain what is going on ? There must be something with the way my output file is being shaped, but I don't get how. Is it due to the encoding of the file, datatype used (float32 in both input and output here), or any other NetCDF format issue ?
Also, do you have any idea how I could optimize the size of the output file ?
To help you understand the differences, here is a summary of the input file :
<xarray.Dataset>
Dimensions: (latitude: 7, longitude: 8, time: 113952)
Coordinates:
number int64 ...
* time (time) datetime64[ns] 1979-01-01 ... 1991-12-31T23:00:00
step timedelta64[ns] ...
surface float64 ...
* latitude (latitude) float64 47.5 47.25 47.0 46.75 46.5 46.25 46.0
* longitude (longitude) float64 -3.0 -2.75 -2.5 -2.25 -2.0 -1.75 -1.5 -1.25
valid_time (time) datetime64[ns] ...
Data variables:
u10 (time, latitude, longitude) float32 ...
Attributes:
GRIB_edition: 1
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_subCentre: 0
Conventions: CF-1.7
institution: European Centre for Medium-Range Weather Forecasts
history: 2022-05-04T13:10 GRIB to CDM+CF via cfgrib-0.9.9...
And the output one :
<xarray.Dataset>
Dimensions: (latitude: 7, longitude: 8, time: 113952)
Coordinates:
* time (time) datetime64[ns] 1979-01-01 ... 1991-12-31T23:00:00
* latitude (latitude) float64 46.0 46.25 46.5 46.75 47.0 47.25 47.5
* longitude (longitude) float64 -3.0 -2.75 -2.5 -2.25 -2.0 -1.75 -1.5 -1.25
Data variables:
number (time, latitude, longitude) int64 0 0 0 0 0 0 0 ... 0 0 0 0 0 0
step (time, latitude, longitude) timedelta64[ns] 00:00:00 ... 00:0...
surface (time, latitude, longitude) float64 0.0 0.0 0.0 ... 0.0 0.0 0.0
valid_time (time, latitude, longitude) datetime64[ns] 1979-01-01 ... 199...
direction (time, latitude, longitude) float32 17.76 26.89 ... 180.1 178.5
Only difference I can see is that some of the original coordinates are now data variables.
Most importantly, all your variables are now indexed by (time, latitude, longitude)
, so will have the full array size of (7 x 8 x 113952)
. Previously, number
, step
, and surface
were scalars (not an array - just a single value), and valid_time
was only indexed by time
. Since all of these are 64-bit, your new array now has effectively 4 new variables each of which are twice the size as u10
. So that alone accounts for a 9x increase.
To ensure this doesn't happen, try being really careful to only do math & reshape operations with DataArrays, not Datasets. Xarray works perfectly well when working with Datasets, but the behavior isn't always intuitive when you're just getting the hang of it, and automatic broadcasting like this is one of the things that can catch you off guard. I always recommend that people do their work with DataArrays and then create a Dataset before write for this reason. See the docs on Broadcasting by dimension name and Automatic Alignment for more on this topic.
I'd also expect that if you got this data from ECMWF the source data may be compressed on disk, which is not the default for ds.to_netcdf
. See the docs on writing netCDFs for a discussion of compression options.