pythonnetcdfcdo-climateweatherdata

NetCDF: calculate annual sum of precipitation along all coordinates using MSWEP reanalysis dataset


I am working with a nc file that has been created by merging single monthly nc files. The actual file ('precip_combined.nc') contains global monthly precipitation data from 1979-2020 (size: 3.79 GB):

<xarray.Dataset>
Dimensions:        (lon: 3600, lat: 1800, time: 504)
Coordinates:
  * lon            (lon) float32 -179.9 -179.8 -179.8 ... 179.8 179.9 179.9
  * lat            (lat) float32 89.95 89.85 89.75 ... -89.75 -89.85 -89.95
  * time           (time) datetime64[ns] 1979-01-01 1979-02-01 ... 2020-12-01
Data variables:
    precipitation  (time, lat, lon) float32 ...
Attributes:
    history:          Created on 2022-10-09 13:23
    input_data_hash:  1ff7f07166074240030172bae541afd3ecbc27ef941e50a65b04a26...type here

The goal is:

  1. to calculate for each given coordinate (lat, lon) the annual sum of precipitation, based on the hydrological ("water") year which is commonly defined from October of year n until September of year n+1.
  2. store the results in a nc file with the new variable precipitation_sums for each year and for each coordinate.

I used this code:

import xarray as xr

# Open the netCDF file
data = xr.open_dataset('.../precip_combined.nc')

# Define the start and end months of the hydrological year
start_month = 10
end_month = 9

# Initialize an empty list to store the sums of precipitation
precipitation_sums = []

# Loop through each year from 1979 to 2020
for year in range(1979, 2021):
    # Define the start and end dates of the hydrological year
    start_date = f'{year}-{start_month:02d}-01'
    end_date = f'{year+1}-{end_month:02d}-30'
    
    # Select the data for the hydrological year
    hydro_year_data = data.sel(time=slice(start_date, end_date))
    
    # Calculate the sum of precipitation for each lon and lat coordinate
    precipitation_sum = hydro_year_data['precipitation'].sum(dim=['lon', 'lat'])
    
    # Append the sum to the list
    precipitation_sums.append(precipitation_sum)

# Convert the list of sums to a new xarray dataset
precipitation_sums_dataset = xr.concat(precipitation_sums, dim='time')

# Add the precipitation sums dataset to the original dataset
data['precipitation_sums'] = precipitation_sums_dataset

# Save the updated dataset as a new netCDF file
data.to_netcdf('.../precip_annual_sums.nc')

and got as an output a nc file with following shapes:

<xarray.Dataset>
Dimensions:             (lon: 3600, lat: 1800, time: 504)
Coordinates:
  * lon                 (lon) float32 -179.9 -179.8 -179.8 ... 179.8 179.9 179.9
  * lat                 (lat) float32 89.95 89.85 89.75 ... -89.75 -89.85 -89.95
  * time                (time) datetime64[ns] 1979-01-01 ... 2020-12-01
Data variables:
    precipitation       (time, lat, lon) float32 ...
    precipitation_sums  (time) float32 ...
Attributes:
    history:          Created on 2022-10-09 13:23
    input_data_hash:  1ff7f07166074240030172bae541afd3ecbc27ef941e50a65b04a26...type here

As you see, the Data variable precipitation_sums does not contain any coordinates. For example, if I read the created nc file and extract data for London, I get no output.

Note that I need the global dataset, not for a specific location as asked here https://stackoverflow.com/questions/65072907/calculate-monthly-average-and-annual-sum-of-precipitation-from-netcdf-with-cdo

I would appreciate any help. There is no comprehensible example available on GitHub etc., probably the task is too easy for pros.


Solution

  • As you tagged the question with cdo I presume a cdo answer is also acceptable for you, in this case you can shift the time index back 9 months, so October is now January, sum over the year, and then shift back so the timestamp is correct:

    cdo shifttime,9months -yearsum -shifttime,-9months in.nc out.nc
    

    Note:

    1. You can call this from within python using system commands or the cdo package
    2. You can also use the option --timestat_date with options last and first to choose where the timestamp is defined (default is mid, so the data will April)