pythondaskpython-xarraydask-ml

How to convert multiple 2D arrays to 1D columns using xarray and dask in python?


I have 7 2D cloud optimised geotiffs stacked into one data array in xarray. They are very large so I am using the intake-xarray extension and dask for streaming the data from s3 without using any RAM. I have concatenated them along their "band" dimension to stack them.

catalog = intake.open_catalog("s3://example-data/datasets.yml")
datasets = ['dem',
            'dem_slope',
            'dem_slope_aspect',
            'distance_railways',
            'distance_river',
            'distance_road',
            'worldcover']
to_concat = []
for data in datasets:
    x = catalog[data].to_dask()
    to_concat.append(x)


merged = xr.concat(to_concat, dim='band')
merged.coords['band'] = datasets  # add labels to band dimension

y_array = catalog["NASA_global"]
y_array.coords['band'] = 'NASA_global'

merged 
<xarray.DataArray (band: 7, y: 225000, x: 450000)>
dask.array<concatenate, shape=(7, 225000, 450000), dtype=float32, chunksize=(1, 256, 256), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) <U31 'dem' ... 'worldcover'
  * y        (y) float64 90.0 90.0 90.0 90.0 90.0 ... -90.0 -90.0 -90.0 -90.0
  * x        (x) float64 -180.0 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0 180.0
Attributes:
    transform:      (0.0008, 0.0, -180.0, 0.0, -0.0008, 90.0)
    crs:            +init=epsg:4326
    res:            (0.0008, 0.0008)
    is_tiled:       1
    nodatavals:     (32767.0,)
    scales:         (1.0,)
    offsets:        (0.0,)
    AREA_OR_POINT:  Area

My question is how I could now convert the data into a several 1D columns, equivalent to flattening a 2D array in numpy? I have looked at .squeeze() to remove dimension but can't get it into the desired format. I want to do some machine learning and need it in a suitable format. New to dask and xarray.

I'd really appreciate any help or advice.


Solution

  • For anyone interested, I worked out how to do it in Xarray but it blew up the memory of my instance.

    # load the intake catalog
    catalog = intake.open_catalog("s3://example-data/datasets.yml")
    datasets = ['dem',
                'dem_slope',
                'dem_slope_aspect',
                'distance_railways',
                'distance_river',
                'distance_road',
                'worldcover']
    to_concat = []
    for data in datasets:
        x = catalog[data].to_dask()
        to_concat.append(x)
    
    # define X and y
    merged = xr.concat(to_concat, dim='band').sel(x=slice(-124, -66), y=slice(50, 24))
    merged.coords['band'] = datasets
    X_array = merged.values
    y_array = catalog["NASA_global"].to_dask()
    y_array.coords['band'] = 'NASA_global'
    
    # reshape
    X_temp = X_array.stack(z=('x','y'))
    X = X_temp.transpose('z', 'band')
    

    Calling X_array = merged.values loads everything into a numpy array and kills the instance. A colleague worked out a better solution that doesn't eat memory:

    catalog = intake.open_catalog("s3://example-data/datasets.yml")
    datasets = ['dem',
                'dem_slope',
                'dem_slope_aspect',
                'distance_railways',
                'distance_river',
                'distance_road',
                'worldcover']
    to_concat = []
    for data in datasets:
        x = catalog[data].to_dask()
        to_concat.append(x)
    
    # define X and y
    X_array = xr.concat(to_concat, dim='band').sel(x=slice(-124, -66), y=slice(50, 24))
    X_array.coords['band'] = datasets
    y_array = catalog["NASA_global"].to_dask()
    
    # reshape
    X_table = X_array.data.reshape((7, -1), merge_chunks=True)
    y_table = y_array.data.reshape((1, -1), merge_chunks=True)
    X = dd.from_dask_array(X_table.T, columns=datasets)
    y = dd.from_dask_array(y_table.T, columns=['NASA_global'])