pythonpython-3.xpython-2.7multidimensional-arraypython-xarray

Xarray.open_dateset uses more than double the size of the file itself


All, I am opening NetCDF files from Copernicus data center using xarray version 2024-11-0, using open_dataset function as the following:

import xarray as xr
file1=xr.open_dataset("2021-04.nc")
tem  = file1['t2m']

The netcdf file is available on the box, the reader can also download any file sample from the aforementioned data center.

Although the file size is 16.6 Mb, tem variable seems to take double the size of the actual file as could be seen below (end of the first line) or monitored by using the top command

<xarray.DataArray 't2m' (valid_time: 30, latitude: 411, longitude: 791)> Size: 39MB
[9753030 values with dtype=float32]
Coordinates:
    number      int64 8B ...
  * latitude    (latitude) float64 3kB 38.0 37.9 37.8 37.7 ... -2.8 -2.9 -3.0
  * longitude   (longitude) float64 6kB -18.0 -17.9 -17.8 ... 60.8 60.9 61.0
  * valid_time  (valid_time) datetime64[ns] 240B 2021-04-01 ... 2021-04-30
Attributes: (12/32)
    GRIB_paramId:                             167
    GRIB_dataType:                            fc
    GRIB_numberOfPoints:                      325101
    GRIB_typeOfLevel:                         surface
    GRIB_stepUnits:                           1
    GRIB_stepType:                            instant
                                      ...
    GRIB_totalNumber:                         0
    GRIB_units:                               K
    long_name:                                2 metre temperature
    units:                                    K
    standard_name:                            unknown
    GRIB_surface:                             0.0

Any idea why xarray uses all that memory. This is not problematic for small files, but it is too problematic for large files and for heavy computation when many copies of the same variable are created.

I can use file1[t2m].astype(‘float16’), which reduces the size to half, but I found that most values are rounded to the first decimal, so I am losing actual data. I want to read the actual data without having to use memory beyond the size of the data file.

This is how the data looks like when being read as float 32

<xarray.DataArray 't2m' (valid_time: 30)> Size: 120B
array([293.87134, 296.0669 , 299.4065 , 302.60474, 305.29443, 306.87646,
       301.10645, 302.47388, 299.23267, 294.26587, 295.239  , 299.19238,
       302.20923, 307.48193, 307.2202 , 310.6953 , 315.64746, 312.76416,
       305.2173 , 299.25488, 299.9475 , 302.3435 , 306.32422, 312.75342,
       299.99878, 300.59155, 303.36475, 307.11768, 308.49292, 310.6853 ],
      dtype=float32)
Coordinates:

and this is how it looks like under float 16

<xarray.DataArray 't2m' (valid_time: 30)> Size: 60B
array([293.8, 296. , 299.5, 302.5, 305.2, 307. , 301. , 302.5, 299.2,
       294.2, 295.2, 299.2, 302.2, 307.5, 307.2, 310.8, 315.8, 312.8,
       305.2, 299.2, 300. , 302.2, 306.2, 312.8, 300. , 300.5, 303.2,
       307. , 308.5, 310.8], dtype=float16)

Moreover, when I dump the data to the RAM and trace the amount the memory being used, it several fold the actuall size of the file data.

import psutil
process = psutil.Process()
print(“memory used in MB=", process.memory_info().rss / 1024**2)
tem.data
print(“memory used in MB=", process.memory_info().rss / 1024**2)

Thanks


Solution

  • Credit goes to kmuehlbauer https://github.com/pydata/xarray/issues/9946#issuecomment-2587287969 The data in the file is compressed:

    This is an excerpt of the h5dump -Hp 2021-04.nc:

    DATASET "t2m" {
          DATATYPE  H5T_IEEE_F32LE
          DATASPACE  SIMPLE { ( 30, 411, 791 ) / ( 30, 411, 791 ) }
          STORAGE_LAYOUT {
             CHUNKED ( 15, 206, 396 )
             SIZE 12481395 (3.126:1 COMPRESSION)
          }
          FILTERS {
             PREPROCESSING SHUFFLE
             COMPRESSION DEFLATE { LEVEL 1 }
          }
          ...