pythonarrayspandasmaxnetcdf

Find max value for each year in 3D array NetCDF file (way to use Pandas or xarray?)


I am trying to makes some maps and such of data from several NetCDF files. Each one contains data for 5 years worth of data. The data is in a 3D array of shape (14608, 145, 192) (time, lat, lon).

I would like the maximum value for each year at each coordinate, so basically when it's all said and done I'll have an output array with shape (5,145,192) (one value per each lat. and lon. value).

It has been suggested I try using pandas, specifically DataFrame and DatetimeIndex, but I couldn't find a way to use it for more anything greater than a 2D array. Xarray was also suggested, but I haven't used xarray before and wouldn't know where to start.

Edit 1: Sample Data

Here is a simplified version of what I've been trying to do with pandas and then I realized DataFrame doesn't work for a 3D array.

import numpy as np
import pandas as pd

fake = np.random.randint(2, 30, size = (14608,145,192))
index = pd.date_range(start = '1985-1-1 01:30:00', end = '1989-12-31 22:30:00' , freq='3H')

df = pd.DataFrame(data = fake, index = index)

Edit 2: Fixed Listed Array Shape

To clarify, I actually want an array with shape (5, 145, 192) as the output. I wrote it wrong because originally I was splitting the 3D array into 5 separate arrays, finding the max, and then stacking them again into one array witch ended with a shape of (5, 145, 192).

I want to be able to skip the tedious breaking apart the array by hand, so to speak, that I was doing before and simplify the code.


Solution

  • Here's how you could approach this using Xarray:

    import xarray as xr
    
    # open one of your files
    ds = xr.open_dataset('path/to/your/ncfile.nc')
    
    # find maximum for a specific year (1990 in this example)
    ds_ymax = ds.sel(time=slice('1990-01-01', '1990-12-31')).max('time')
    
    # plot a single variable ('temperature' in this example)
    ds_ymax['temperature'].plot()
    

    While that covers the basics of what you're trying to do, there are a few other common workflow things I figured I should mention:

    1. Open multiple files at once. Xarray provides a open_mfdataset function that allows for quick concatenation of multiple files at once:

      ds = xr.open_mfdataset('path/to/your/ncfiles/*nc')  # note the use of the wildcard
      
    2. Using resample to calculate annual maximum values. In my example above, I manually selected a single years worth of data but it is possible to do this programmatically using resample or groupby

      # using resample ('AS' == annual starting Jan-1)
      ds_ymax = ds.resample(time='AS').max('time')
      
      # using groupby
      ds_ymax = ds.groupby('time.year').max('time')
      

    Finally, you mentioned not knowing where to start with xarray. Take a look at the documentation: http://xarray.pydata.org/en/latest/index.html