[SOLVED] Easy, scriptable way to sub-sample unstructured THREDDS data?

Easy, scriptable way to sub-sample unstructured THREDDS data?

I'm trying to get a subset of data from a triangular mesh model that is being served by THREDDS. I'd like to be able to specify a LAT/LON bounding box and just get the data from within that box. The Data URL is:

http://www.smast.umassd.edu:8080/thredds/dodsC/FVCOM/NECOFS/Forecasts/NECOFS_MET_FORECAST.nc

With gridded data it is pretty easy to subset the data from a THREDDS server. Does anyone know what the best way is to get a sub-domain of a triangular mesh that is served by THREDDS?

For gridded data I use Ferret as my OPeNDAP client and I'm able to script the download process. I'd like to do something similar here although I could use Matlab, Python or other tools.

Thanks,

Steve

Solution

Opendap and NetCDF don't allow extraction using an irregular indexing. You can only request a start, stop and stride.

And because this is a triangular grid, there is no guarantee that nodes of the triangles in the same region have similar indices. So if you want to get only those nodes within a bounding box, you would have to request them one-by-one. And that is slow. So in many cases it's faster to determine the minimum and maximum indices and request that whole chunk in one piece, then extract the indices as needed.

Here's a sample comparison of the two approaches in python. In this example, extracting the subset which includes all the indices is about 10 times faster than looping over each index, extracting time series:

import netCDF4
import time
import numpy as np

url='http://www.smast.umassd.edu:8080/thredds/dodsC/FVCOM/NECOFS/Forecasts/NECOFS_GOM3_FORECAST.nc'
nc = netCDF4.Dataset(url)
ncv = nc.variables
lon = ncv['lon'][:]
lat = ncv['lat'][:]
tim = ncv['time'][:]

# find indices inside box
box = [-71.4,41,-70.2,41.5]
ii = (lon>=box[0])&(lon<=box[2])&(lat>=box[1])&(lat<=box[3])
# jj will have just indices from inside the box:
jj = np.where(ii)[0]

If we loop over each index, it's slow:

# loop over indices, extracting time series
time0=time.time()
zi = np.zeros((len(tim),len(jj)))
k=0
for j in jj:
    zi[:,k]=ncv['zeta'][:,j]
    k +=1
print('elapsed time: %d seconds' % (time.time()-time0))

elapsed time: 56 seconds

But if we loop over the range at each time step, it's much faster:

time0=time.time()
zi2 = np.zeros((len(tim),len(jj)))
jmin=jj.min()
jmax=jj.max()

for i in range(len(tim)):
    ztmp = ncv['zeta'][i,jmin:jmax+1]
    zi2[i,:] = ztmp[jj-jmin]
print('elapsed time: %d seconds' % (time.time()-time0))

elapsed time: 6 seconds

Of course, you results may vary depending on the size of the unstructured grid, the proximity of the points in your subset, the number of points you are extracting, etc. But hopefully this gives you an idea of the issues involved.