python-xarraygeotiff

How does xarray load and index large GeoTIFF files with open_rasterio?


I'm using the xarray package to load and access data of large GeoTIFF files (>50GB) and it is working flawless.

import xarray as xr
img = xr.open_rasterio("path/to/large_geo_tiff.tif")
pixel_value = img[0,1225, 4321]
print("The pixel value is: ", pixel_value.values.item())

However, I was wondering how xarray is actually loading large GeoTIFF files. Obviously it is not loading the whole file into memory since it would not fit but performs some kind of lazy loading. I'm only used to dask and Dask Arrays which split the data into chunks and allows for easy pixel-access by just loading the corresponding chunk into memory. But the signature of the function open_rasterio looks like the following

xarray.open_rasterio(filename, parse_coordinates=None, chunks=None, cache=None, lock=None)

Since I didn't define chunks the img should not be chunked. So my question is what happens when calling pixel_value = img[0,1225, 4321] and how can xarray access the pixel value at the given location so fast?

I look forward to any feedback.


Solution

  • It seems like dask_image is not so smart at reading tiff files in contrast to xarray

    The provided GeoTIFF is not tiled but I just checked the tiff file format which looks like the following

    TIFF File: '/Users/jan/Desktop/ms-project/processed/metashape/export/untiled.tif'
                           Mode: 'r'
        Current Image Directory: 1
               Number Of Strips: 1336
                    SubFileType: Tiff.SubFileType.Default
                    Photometric: Tiff.Photometric.RGB
                    ImageLength: 42746
                     ImageWidth: 35366
                   RowsPerStrip: 32
                  BitsPerSample: 8
                    Compression: Tiff.Compression.LZW
                   SampleFormat: Tiff.SampleFormat.UInt
                SamplesPerPixel: 3
            PlanarConfiguration: Tiff.PlanarConfiguration.Chunky
                    Orientation: Tiff.Orientation.TopLeft
    

    the important part is that the tif is stripped, where each strip is an individual collection of contiguous row of bitmapped image data which makes random access much easier and xarray seems to take advantage of the stripped image by just loading the required data - I just checked this by loading the GeoTIFF calling the backend module used by xarray.open_rasterio called rasterio

    >>> import rasterio
    >>> ortho = rasterio.open("./orthomosaic.tif")
    >>> ortho.is_tiled
    False
    >>> ortho.block_shapes
    [(32, 248489), (32, 248489), (32, 248489)]
    

    dask_image on the other hand does not take advantage of those stripes but loads the whole image into one chunk therefore running out out memory - even if you write a tiled tif file dask_image will not consider this

    >>> import dask_image.imread
    >>> ortho = dask_image.imread.imread("./orthomosaic.tif")
    >>> ortho
    dask.array<from-value, shape=(1, 104700, 248489, 3), dtype=uint8, chunksize=(1, 104700, 248489, 3), chunktype=numpy.ndarray>