rnetcdf

CMIP6, automating by month


Thanks in advance for any help and insight. I am still new/learning R.

I am working with CMIP6 historical data across several models for several variables. Ideally, my baseline would run from 1850 - 1950. In this example, I have monthly temperature data for 2 different climate models, I'd like to work across 6 models. Both models use "days since 1850 01 01" to account for time at 29.5 day increments. Once I create an array, I can slice data out at 1, 13, 25, etc. to represent a given month/year.

I would like to extract data as an average across each historical time period for each month. So one data layer/file per month averaged across 100 years. I have code worked out to extract a single month, for a single year, for a single model, but this would be brutal for a baseline from 1850 - 1950, or even 1900 - 1950. (100 years * 12 months * 6 models = 72,000 rows of code).

Is there a for loop or other more elegant/simple approach to make this feasible? When I say I am a newb to R, I would be unablenot sure how to write a forfor loop using this data structure.

I would prefer to avoid reducing the number of years and number of models to make this a feasible lift. I could copy and paste and manually adjust, but this would be very time consuming, and I know the more experienced coders are shaking their heads/laughing at me:).

I am using the following (brutally simple unelegant) code:

library(raster)
library(RNetCDF)
library(ncdf4)

####open .nc file for each model, variable, and time period###

nc_data <- nc_open('tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_185001-194912.nc')
nc_data2 <- nc_open('tas_Amon_GISS-E2-1-G_historical_r101i1p1f1_gn_185001-194912.nc')

lon <- ncvar_get(nc_data, "lon_bnds")
lat <- ncvar_get(nc_data, "lat_bnds")
t <- ncvar_get(nc_data, "time_bnds")

lon2 <- ncvar_get(nc_data2, "lon_bnds")
lat2 <- ncvar_get(nc_data2, "lat_bnds")
t2 <- ncvar_get(nc_data2, "time_bnds")

head(t)
nc_data$dim ###QA/QC: lat len 180; lon len 288; time len 1200; note time $vals starts at 15.5,then increases by 29.5 thereafter, e.g. 15.5, 45.0, 74.5 etc###

tas.arrayGFDL <- ncvar_get(nc_data, "tas") ###store the data in a 3-dimensional array###
dim(tas.arrayGFDL) ###QA/QC matches the length given nc_data$dim)###

tas.jan1850GFDL <- tas.arrayGFDL[, , 1] ###first slice is jan 1850###
dim(tas.jan1850GFDL) ##correct lat, lon##

tas.jan1851GFDL <- tas.arrayGFDL[, , 13] ###13th slice is jan 1851###
dim(tas.jan1851GFDL) ##correct lat, lon##

tas.jan1850rGFDL<- raster(t(tas.jan1850GFDL), xmn=min(lon), xmx=max(lon), ymn=min(lat), ymx=max(lat), crs=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs+ towgs84=0,0,0"))
tas.jan1851rGFDL<- raster(t(tas.jan1850GFDL), xmn=min(lon), xmx=max(lon), ymn=min(lat), ymx=max(lat), crs=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs+ towgs84=0,0,0"))

plot(tas.jan1850rGFDL)

Jan 1850 GFDL GCM

plot(tas.jan1851rGFDL)

tas.jan1850rfGFDL <- flip(tas.jan1850rGFDL, direction='y')
tas.jan1851rfGFDL <- flip(tas.jan1851rGFDL, direction='y')

plot(tas.jan1850rfGFDL)

Jan 1850 GFDL GCM flipped

plot(tas.jan1851rfGFDL)

writeRaster(tas.jan1850rfGFDL, "tas.jan1850rfGFDL.tif", "GTiff", overwrite=TRUE)

writeRaster(tas.jan1851rfGFDL, "tas.jan1851rfGFDL.tif", "GTiff", overwrite=TRUE)

Then repeat for each month, each year, each model.


Solution

  • With package ncdfCF you get a more intelligent solution for netCDF files that follow the CF Metadata Conventions (CF), as all CMIP6 files do (among very many other data sets). It is pure R code, no need for external libraries.

    Step-by-step

    For a single file it works like in the blocks of code below. This is the step-by-step version so that you see and learn what is going on. The compact version for multiple models and variables is further down.

    package(ncdfCF)
    package(CFtime)
    
    ds <- open_ncdf("GFDL_file.nc")
    ds
    #> <Dataset> ts_Amon_GFDL-ESM4_historical_r1i1p1f1_gr1_18500116-19491216 
    #> Resource   : ~/ts_Amon_GFDL-ESM4_historical_r1i1p1f1_gr1_18500116-19491216.nc 
    #> Format     : netcdf4 
    #> Conventions: CF-1.7 CMIP-6.0 UGRID-1.0 
    #> Keep open  : FALSE 
    #> Has groups : FALSE 
    #> 
    #> Variable:
    #>  name long_name           units data_type axes          
    #>  ts   Surface Temperature K     NC_FLOAT  lon, lat, time
    #> 
    #> Axes:
    #>  id axis name long_name     length unlim values                                        unit
    #>  0  T    time               1200   U     [1850-01-16 12:00:00 ... 1949-12-16 12:00:00] days since 1850-01-01
    #>  1       bnds vertex number    2         [1 ... 2]                                    
    #>  2  Y    lat  latitude       180         [-89.5 ... 89.5]                              degrees_north
    #>  3  X    lon  longitude      288         [0.625 ... 359.375]                           degrees_east
    #> 
    #> Attributes:
    #> (omitted for brevity)
    

    First of all, the file is slightly different than yours, it is GFDL-ESM4 with variable "ts", but that should not make any difference. Some other interesting details:

    Getting the data variable is easy. But that still doesn't get you the actual data, you need to explicitly ask for it (the reason being that you might want to subset to a specific area or time period before doing the actual reading of any data from file).

    # Get the data variable `ts`
    ts <- ds[["ts"]]
    
    # Read the actual data from file into a "CFData" object
    d <- ts$data()
    d
    #> <Data> ts 
    #> Long name: Surface Temperature 
    #> 
    #> Values: [192.6449 ... 320.3623] K
    #>     NA: 0 (0.0%)
    #> 
    #> Axes:
    #>  id axis name long_name length unlim values                                        unit                 
    #>  3  X    lon  longitude  288         [0.625 ... 359.375]                           degrees_east         
    #>  2  Y    lat  latitude   180         [-89.5 ... 89.5]                              degrees_north        
    #>  0  T    time           1200   U     [1850-01-16 12:00:00 ... 1949-12-16 12:00:00] days since 1850-01-01
    #> 
    #> Attributes:
    #> (omitted)
    
    # Now get the data into an R array
    arr <- d$array()
    str(arr)
    #>  num [1:180, 1:288, 1:1200] 241 241 241 241 242 ...
    #>  - attr(*, "dimnames")=List of 3
    #>   ..$ lat : chr [1:180] "89.5" "88.5" "87.5" "86.5" ...
    #>   ..$ lon : chr [1:288] "0.625" "1.875" "3.125" "4.375" ...
    #>   ..$ time: chr [1:1200] "1850-01-16 12:00:00" "1850-02-15 00:00:00" "1850-03-16 12:00:00" "1850-04-16 00:00:00" ...
    

    This array is oriented in the regular R way, so-called "column-major order". So the first dimension is decreasing latitude, then longitude, then time as 1,200 months. Turning this into monthly averages is done using a factor, from base R. Here, however, I am using package CFtime to create the factor because that is better suited to working with climate projection data.

    So let's look at the "time" dimension:

    t <- d$axes[["time"]]
    t
    #> <Time axis> [0] time
    #> Length   : 1200 (unlimited)
    #> Axis     : T 
    #> Range    : 1850-01-16 12:00:00 ... 1949-12-16 12:00:00 (days) 
    #> Bounds   : 1850-01-01 ... 1950-01-01 
    #> 
    #> Attributes:
    #>  id name          type    length value                
    #>  0  long_name     NC_CHAR  4     time                 
    #>  1  axis          NC_CHAR  1     T                    
    #>  2  calendar_type NC_CHAR  6     noleap               
    #>  3  bounds        NC_CHAR  9     time_bnds            
    #>  4  standard_name NC_CHAR  4     time                 
    #>  5  description   NC_CHAR 13     Temporal mean        
    #>  6  units         NC_CHAR 21     days since 1850-01-01
    #>  7  calendar      NC_CHAR  6     noleap
    

    Again some insights:

    You can get your hands on the CFtime object and then create a factor for monthly data over the period 1850-1949:

    cft <- t$values
    fac <- CFfactor(cft, epoch = 1850:1949)
    

    In the final step you can calculate the monthly means over the entire period with straight R:

    # Read the below as: apply over the first 2 dimensions, latitude and longitude,
    # the function tapply(), which applies the mean() function using the factor
    # over the third dimension, time.
    res <- apply(arr, 1:2, tapply, fac, mean)
    
    # tapply() changes the axis order, so change back
    res <- aperm(res, c(2, 3, 1))
    str(res)
    #>  num [1:180, 1:288, 1:12] 240 240 240 241 242 ...
    #>  - attr(*, "dimnames")=List of 3
    #>   ..$ lat: chr [1:180] "89.5" "88.5" "87.5" "86.5" ...
    #>   ..$ lon: chr [1:288] "0.625" "1.875" "3.125" "4.375" ...
    #>   ..$    : chr [1:12] "01" "02" "03" "04" ...
    

    Now you can further process the array as you please.

    Easy! That's 12,000 lines of your code done.

    In a function

    Putting it all in a function is probably the best way to code once, run often. You can then also add further processing or other tweaks to simplify your analysis.

    processCMIP6model <- function(fn) {
      ds <- open_ncdf(fn)
      var <- ds[[names(ds)]]
      d <- var$data()
      arr <- d$array()
      t <- d$axes[["time"]]$values
      fac <- CFfactor(t, epoch = 1850:1949)
      aperm(apply(arr, 1:2, tapply, fac, mean), c(2, 3, 1))
    }
    

    Assuming your 6 CMIP6 models are in a single directory you can do it all in one sweep:

    lf <- list.files("~/CMIP6/data/directory", ".nc$", full.names = TRUE)
    res <- lapply(lf, processCMIP6model)
    names(res) <- basename(lf)
    

    res is a list with one element for each of the files in your directory. Note that you can do this with as many CMIP6 files as you have, mixing variables as you go.

    Plotting and other analyses

    You can easily plot the resulting list using the terra package (which is the successor to the raster package):

    # Starting with the list of model results from above
    model1r <- terra::rast(res[[1]])
    
    # Optionally, set layer names
    names(model1r) <- month.abb
    model1r
    #> class       : SpatRaster 
    #> dimensions  : 180, 288, 12  (nrow, ncol, nlyr)
    #> resolution  : 1, 1  (x, y)
    #> extent      : 0, 288, 0, 180  (xmin, xmax, ymin, ymax)
    #> coord. ref. :  
    #> source(s)   : memory
    #> names       :      Jan,      Feb,      Mar,      Apr,      May,      Jun, ... 
    #> min values  : 228.2004, 221.2835, 208.5045, 202.0122, 200.8219, 200.0914, ... 
    #> max values  : 313.0564, 310.6226, 311.3170, 311.8899, 313.6404, 317.6213, ... 
    
    plot(model1r, "Jan")
    

    enter image description here

    Other kinds of analysis can similarly be done directly using the array in the res list.

    72,000 lines of code done.