rnetcdfthreddsncml

How to get the right URL for thredds and load ncml data in R?


I'm trying to load some climate data from PAVICS with thredds in R.

library(ncdf4)
library(thredds)

url = "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/datasets/simulations/bias_adjusted/cmip6/ouranos/ESPO-G/ESPO-G6-R2v1.0.0/derived/ensemble_members/catalog.xml"

# Create Catalog
cat = thredds::get_catalog(url) # Seems to be the same thredds::CatalogNode$new(url)

# Check the catalog 
cat$browse()

# Get the datasets
catds = cat$get_datasets()

# Verify one ncml file
data.test = catds$`annual_ssp585_ESPO-G6-R2v1.0.0._climindices_ensemble_members.ncml`
# Try to open the file 
ncdf4::nc_open(paste0("https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/", data.test$url))

But I get a file not found error

Error in R_nc4_open: NetCDF: file not found
Error in ncdf4::nc_open(paste0("https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/",  : 
  Error in nc_open trying to open file https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/datasets/simulations/bias_adjusted/cmip6/ouranos/ESPO-G/ESPO-G6-R2v1.0.0/derived/ensemble_members/annual_ssp585_ESPO-G6-R2v1.0.0._climindices_ensemble_members.ncml (return_on_error= FALSE )

Solution

  • The base URL for the file is not correct. You have taken the URL from the catalog path but the actual data (for OPENDaP access) is under the dodsC branch. So you should do this:

    base_dodsC <- "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/"
    nc <- ncdf4::nc_open(paste0(base_dodsC, data.test$url))
    

    There are actually multiple paths to the data, depending on what tool you use to access the data. All of these have a specific base URL. Have a look here.

    Unfortunately, the thredds library does not standardise how to construct URLs so it's a matter of trial-and-error.

    Note also that you are accessing an NcML file, which describes a data set but which does not contain the data. In effect, the NcML file contains the metadata that you would read when you ncdf4::nc_open() a file. The NcML file itself is an XML document that contains references to the underlying data, possibly distributed over multiple files. You can use the usual ncdf4 syntax to discover file contents and access the data:

    names(nc$var)
    #  [1] "rotated_pole"                  "lat"                          
    #  [3] "lon"                           "dlyfrzthw"                    
    #  [5] "growing_degree_days_gt_4"      "heat_wave_frequency_class1"   
    #  [7] "heat_wave_frequency_class2"    "heat_wave_frequency_class3"   
    #  [9] "heat_wave_total_length_class1" "heat_wave_total_length_class2"
    # [11] "heat_wave_total_length_class3" "liquidprcptot"                
    # [13] "prcptot"                       "rx5day"                       
    # [15] "solidprcptot"                  "tg_mean"                      
    # [17] "tn_days_above_20"              "tn_days_above_22"             
    # [19] "tn_mean"                       "tx_days_above_30"             
    # [21] "tx_days_above_32"              "tx_mean"                      
    # [23] "lakeFrac"                      "sftlf"                        
    # [25] "sftof"
    
    lat <- ncvar_get(nc, "lat")
    str(lat)
    # num [1:706, 1:800] 5.76 5.79 5.82 5.85 5.88 ...
                            
    

    The regular variables I looked at are each about 8.9GB in size so you'll want to subset before you download. That is 800 latitude x 706 longitude in a rotated pole grid, for 151 time steps (1950 - 2100), by 26 realizations (= GCM model runs). These realizations are:

    > pavics[["realization"]]$values
     [1] "TaiESM1:r1i1p1f1"       "BCC-CSM2-MR:r1i1p1f1"   "FGOALS-g3:r1i1p1f1"    
     [4] "CanESM5:r1i1p1f1"       "CMCC-ESM2:r1i1p1f1"     "CNRM-CM6-1:r1i1p1f2"   
     [7] "CNRM-ESM2-1:r1i1p1f2"   "ACCESS-ESM1-5:r1i1p1f1" "ACCESS-CM2:r1i1p1f1"   
    [10] "MPI-ESM1-2-HR:r1i1p1f1" "EC-Earth3:r1i1p1f1"     "EC-Earth3-CC:r1i1p1f1" 
    [13] "EC-Earth3-Veg:r1i1p1f1" "INM-CM4-8:r1i1p1f1"     "INM-CM5-0:r1i1p1f1"    
    [16] "IPSL-CM6A-LR:r1i1p1f1"  "MIROC-ES2L:r1i1p1f2"    "MIROC6:r1i1p1f1"       
    [19] "UKESM1-0-LL:r1i1p1f2"   "MPI-ESM1-2-LR:r1i1p1f1" "MRI-ESM2-0:r1i1p1f1"   
    [22] "NorESM2-LM:r1i1p1f1"    "NorESM2-MM:r1i1p1f1"    "KACE-1-0-G:r1i1p1f1"   
    [25] "GFDL-ESM4:r1i1p1f1"     "NESM3:r1i1p1f1"
    

    Extracting Québec

    As per your comment, you want to extract data for Québec province in Canada. There are a few obstacles you have to overcome, after selecting which variable and model/realization you want to use. Let's first see what's in the netCDF resource. For this I am using package ncdfCF, in its development version on GitHub.

    pavics <- ncdf_open(paste0(base_dodsC, data.test$url))
    tg <- pavics[["tg_mean"]]
    tg
    # <Variable> tg_mean 
    # Long name: Mean daily mean temperature 
    # 
    # Axes:
    #  id axis name         long_name                             length values                                             unit                 
    #  3  X    rlon         longitude in rotated pole grid        706    [324.60278 ... 388.0528]                           degrees              
    #  2  Y    rlat         latitude in rotated pole grid         800    [-46.17 ... 25.74]                                 degrees              
    #  4  T    time                                               151    [1950-01-01 ... 2100-01-01]                        days since 1950-01-01
    #  1       realization                                         26    [TaiESM1:r1i1p1f1, BCC-CSM2-MR:r1i1p1f1, FGOALS...                      
    #          rotated_pole coordinates of the rotated North Pole   1    [9.96920996838687e+36]                                                  
    # 
    # Auxiliary longitude-latitude grid:
    #  orientation axis name extent                 unit         
    #  longitude   X    lon  [-179.992 ... 179.973] degrees_east 
    #  latitude    Y    lat  [5.756 ... 83.982]     degrees_north
    # 
    # Attributes:
    # (omitted for brevity)
    

    The principal problem is that the data is in a so-called rotated_pole grid. While the axes are in units of "degrees", this is on an angle to the standard latitude-longitude grid and the values are off, as you have already observed. On top of that, the data is organised differently from what R expects, so if you quickly plot some data (using the terra package here) you'll get some surprising results:

    data <- tg[,,1,1] # Full geographical extent, first time slice, first realization
    r <- terra::rast(data)
    plot(r)
    

    enter image description here

    Apart from North America lying on its side, the distortion in Nunavut should be immediately clear to you. You need to correct for this problem with the rotation and the distortion, which you can do with ncdfCF while subsetting the data to your area of interest:

    # Subset data for Québec, all time steps, one realization
    data <- tg$subset(list(X = -79:-56, Y = 44:63, realization = "TaiESM1:r1i1p1f1"))
    r <- data$terra() # Convert the ncdfCF object to a terra:SpatRaster
    plot(r)
    

    enter image description here

    And there you are. You can also do data$array() to get the values as an array, properly oriented and with dimension names.

    ncdfCF being a very new package, you might run into trouble. If so, open an issue.