Thanks in advance for any help and insight. I am still new/learning R.
I am working with CMIP6 historical data across several models for several variables. Ideally, my baseline would run from 1850 - 1950. In this example, I have monthly temperature data for 2 different climate models, I'd like to work across 6 models. Both models use "days since 1850 01 01" to account for time at 29.5 day increments. Once I create an array, I can slice data out at 1, 13, 25, etc. to represent a given month/year.
I would like to extract data as an average across each historical time period for each month. So one data layer/file per month averaged across 100 years. I have code worked out to extract a single month, for a single year, for a single model, but this would be brutal for a baseline from 1850 - 1950, or even 1900 - 1950. (100 years * 12 months * 6 models = 72,000 rows of code).
Is there a for loop or other more elegant/simple approach to make this feasible? When I say I am a newb to R, I would be unablenot sure how to write a forfor loop using this data structure.
I would prefer to avoid reducing the number of years and number of models to make this a feasible lift. I could copy and paste and manually adjust, but this would be very time consuming, and I know the more experienced coders are shaking their heads/laughing at me:).
I am using the following (brutally simple unelegant) code:
library(raster)
library(RNetCDF)
library(ncdf4)
####open .nc file for each model, variable, and time period###
nc_data <- nc_open('tas_Amon_GFDL-CM4_historical_r1i1p1f1_gr1_185001-194912.nc')
nc_data2 <- nc_open('tas_Amon_GISS-E2-1-G_historical_r101i1p1f1_gn_185001-194912.nc')
lon <- ncvar_get(nc_data, "lon_bnds")
lat <- ncvar_get(nc_data, "lat_bnds")
t <- ncvar_get(nc_data, "time_bnds")
lon2 <- ncvar_get(nc_data2, "lon_bnds")
lat2 <- ncvar_get(nc_data2, "lat_bnds")
t2 <- ncvar_get(nc_data2, "time_bnds")
head(t)
nc_data$dim ###QA/QC: lat len 180; lon len 288; time len 1200; note time $vals starts at 15.5,then increases by 29.5 thereafter, e.g. 15.5, 45.0, 74.5 etc###
tas.arrayGFDL <- ncvar_get(nc_data, "tas") ###store the data in a 3-dimensional array###
dim(tas.arrayGFDL) ###QA/QC matches the length given nc_data$dim)###
tas.jan1850GFDL <- tas.arrayGFDL[, , 1] ###first slice is jan 1850###
dim(tas.jan1850GFDL) ##correct lat, lon##
tas.jan1851GFDL <- tas.arrayGFDL[, , 13] ###13th slice is jan 1851###
dim(tas.jan1851GFDL) ##correct lat, lon##
tas.jan1850rGFDL<- raster(t(tas.jan1850GFDL), xmn=min(lon), xmx=max(lon), ymn=min(lat), ymx=max(lat), crs=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs+ towgs84=0,0,0"))
tas.jan1851rGFDL<- raster(t(tas.jan1850GFDL), xmn=min(lon), xmx=max(lon), ymn=min(lat), ymx=max(lat), crs=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs+ towgs84=0,0,0"))
plot(tas.jan1850rGFDL)
plot(tas.jan1851rGFDL)
tas.jan1850rfGFDL <- flip(tas.jan1850rGFDL, direction='y')
tas.jan1851rfGFDL <- flip(tas.jan1851rGFDL, direction='y')
plot(tas.jan1850rfGFDL)
plot(tas.jan1851rfGFDL)
writeRaster(tas.jan1850rfGFDL, "tas.jan1850rfGFDL.tif", "GTiff", overwrite=TRUE)
writeRaster(tas.jan1851rfGFDL, "tas.jan1851rfGFDL.tif", "GTiff", overwrite=TRUE)
Then repeat for each month, each year, each model.
With package ncdfCF
you get a more intelligent solution for netCDF files that follow the CF Metadata Conventions (CF), as all CMIP6 files do (among very many other data sets). It is pure R code, no need for external libraries.
For a single file it works like in the blocks of code below. This is the step-by-step version so that you see and learn what is going on. The compact version for multiple models and variables is further down.
package(ncdfCF)
package(CFtime)
ds <- open_ncdf("GFDL_file.nc")
ds
#> <Dataset> ts_Amon_GFDL-ESM4_historical_r1i1p1f1_gr1_18500116-19491216
#> Resource : ~/ts_Amon_GFDL-ESM4_historical_r1i1p1f1_gr1_18500116-19491216.nc
#> Format : netcdf4
#> Conventions: CF-1.7 CMIP-6.0 UGRID-1.0
#> Keep open : FALSE
#> Has groups : FALSE
#>
#> Variable:
#> name long_name units data_type axes
#> ts Surface Temperature K NC_FLOAT lon, lat, time
#>
#> Axes:
#> id axis name long_name length unlim values unit
#> 0 T time 1200 U [1850-01-16 12:00:00 ... 1949-12-16 12:00:00] days since 1850-01-01
#> 1 bnds vertex number 2 [1 ... 2]
#> 2 Y lat latitude 180 [-89.5 ... 89.5] degrees_north
#> 3 X lon longitude 288 [0.625 ... 359.375] degrees_east
#>
#> Attributes:
#> (omitted for brevity)
First of all, the file is slightly different than yours, it is GFDL-ESM4 with variable "ts", but that should not make any difference. Some other interesting details:
CF-1.7
metadata conventions. Just the way we like it.ts
in this case. CMIP6 files always have only a single data variable, but not every netCDF package will limit itself to just the data variables. You can also find the name with names(ds)
. The variable has 3 axes, lon
, lat
and time
, and we indeed see them in the list of axes (the bnds
"axis" is not actually and axis, we can forget about that one).Getting the data variable is easy. But that still doesn't get you the actual data, you need to explicitly ask for it (the reason being that you might want to subset to a specific area or time period before doing the actual reading of any data from file).
# Get the data variable `ts`
ts <- ds[["ts"]]
# Read the actual data from file into a "CFData" object
d <- ts$data()
d
#> <Data> ts
#> Long name: Surface Temperature
#>
#> Values: [192.6449 ... 320.3623] K
#> NA: 0 (0.0%)
#>
#> Axes:
#> id axis name long_name length unlim values unit
#> 3 X lon longitude 288 [0.625 ... 359.375] degrees_east
#> 2 Y lat latitude 180 [-89.5 ... 89.5] degrees_north
#> 0 T time 1200 U [1850-01-16 12:00:00 ... 1949-12-16 12:00:00] days since 1850-01-01
#>
#> Attributes:
#> (omitted)
# Now get the data into an R array
arr <- d$array()
str(arr)
#> num [1:180, 1:288, 1:1200] 241 241 241 241 242 ...
#> - attr(*, "dimnames")=List of 3
#> ..$ lat : chr [1:180] "89.5" "88.5" "87.5" "86.5" ...
#> ..$ lon : chr [1:288] "0.625" "1.875" "3.125" "4.375" ...
#> ..$ time: chr [1:1200] "1850-01-16 12:00:00" "1850-02-15 00:00:00" "1850-03-16 12:00:00" "1850-04-16 00:00:00" ...
This array is oriented in the regular R way, so-called "column-major order". So the first dimension is decreasing latitude, then longitude, then time as 1,200 months. Turning this into monthly averages is done using a factor, from base R. Here, however, I am using package CFtime
to create the factor because that is better suited to working with climate projection data.
So let's look at the "time" dimension:
t <- d$axes[["time"]]
t
#> <Time axis> [0] time
#> Length : 1200 (unlimited)
#> Axis : T
#> Range : 1850-01-16 12:00:00 ... 1949-12-16 12:00:00 (days)
#> Bounds : 1850-01-01 ... 1950-01-01
#>
#> Attributes:
#> id name type length value
#> 0 long_name NC_CHAR 4 time
#> 1 axis NC_CHAR 1 T
#> 2 calendar_type NC_CHAR 6 noleap
#> 3 bounds NC_CHAR 9 time_bnds
#> 4 standard_name NC_CHAR 4 time
#> 5 description NC_CHAR 13 Temporal mean
#> 6 units NC_CHAR 21 days since 1850-01-01
#> 7 calendar NC_CHAR 6 noleap
Again some insights:
ncdf4
calls it, but according to CF, the bounds of a data variable are metadata, not a separate variable).CFtime
is specifically designed to deal with the vagaries of the various types of "model time" defined by CF. Note that other models may use other calendars.You can get your hands on the CFtime
object and then create a factor for monthly data over the period 1850-1949:
cft <- t$values
fac <- CFfactor(cft, epoch = 1850:1949)
In the final step you can calculate the monthly means over the entire period with straight R:
# Read the below as: apply over the first 2 dimensions, latitude and longitude,
# the function tapply(), which applies the mean() function using the factor
# over the third dimension, time.
res <- apply(arr, 1:2, tapply, fac, mean)
# tapply() changes the axis order, so change back
res <- aperm(res, c(2, 3, 1))
str(res)
#> num [1:180, 1:288, 1:12] 240 240 240 241 242 ...
#> - attr(*, "dimnames")=List of 3
#> ..$ lat: chr [1:180] "89.5" "88.5" "87.5" "86.5" ...
#> ..$ lon: chr [1:288] "0.625" "1.875" "3.125" "4.375" ...
#> ..$ : chr [1:12] "01" "02" "03" "04" ...
Now you can further process the array as you please.
Easy! That's 12,000 lines of your code done.
Putting it all in a function is probably the best way to code once, run often. You can then also add further processing or other tweaks to simplify your analysis.
processCMIP6model <- function(fn) {
ds <- open_ncdf(fn)
var <- ds[[names(ds)]]
d <- var$data()
arr <- d$array()
t <- d$axes[["time"]]$values
fac <- CFfactor(t, epoch = 1850:1949)
aperm(apply(arr, 1:2, tapply, fac, mean), c(2, 3, 1))
}
Assuming your 6 CMIP6 models are in a single directory you can do it all in one sweep:
lf <- list.files("~/CMIP6/data/directory", ".nc$", full.names = TRUE)
res <- lapply(lf, processCMIP6model)
names(res) <- basename(lf)
res
is a list with one element for each of the files in your directory. Note that you can do this with as many CMIP6 files as you have, mixing variables as you go.
You can easily plot the resulting list using the terra
package (which is the successor to the raster
package):
# Starting with the list of model results from above
model1r <- terra::rast(res[[1]])
# Optionally, set layer names
names(model1r) <- month.abb
model1r
#> class : SpatRaster
#> dimensions : 180, 288, 12 (nrow, ncol, nlyr)
#> resolution : 1, 1 (x, y)
#> extent : 0, 288, 0, 180 (xmin, xmax, ymin, ymax)
#> coord. ref. :
#> source(s) : memory
#> names : Jan, Feb, Mar, Apr, May, Jun, ...
#> min values : 228.2004, 221.2835, 208.5045, 202.0122, 200.8219, 200.0914, ...
#> max values : 313.0564, 310.6226, 311.3170, 311.8899, 313.6404, 317.6213, ...
plot(model1r, "Jan")
Other kinds of analysis can similarly be done directly using the array in the res
list.
72,000 lines of code done.