I'm using the xarray package to load and access data of large GeoTIFF files (>50GB) and it is working flawless.
import xarray as xr
img = xr.open_rasterio("path/to/large_geo_tiff.tif")
pixel_value = img[0,1225, 4321]
print("The pixel value is: ", pixel_value.values.item())
However, I was wondering how xarray is actually loading large GeoTIFF files. Obviously it is not loading the whole file into memory since it would not fit but performs some kind of lazy loading. I'm only used to dask and Dask Arrays which split the data into chunks and allows for easy pixel-access by just loading the corresponding chunk into memory. But the signature of the function open_rasterio looks like the following
xarray.open_rasterio(filename, parse_coordinates=None, chunks=None, cache=None, lock=None)
Since I didn't define chunks the img should not be chunked. So my question is what happens when calling pixel_value = img[0,1225, 4321]
and how can xarray access the pixel value at the given location so fast?
I look forward to any feedback.
It seems like dask_image is not so smart at reading tiff files in contrast to xarray
The provided GeoTIFF is not tiled but I just checked the tiff file format which looks like the following
TIFF File: '/Users/jan/Desktop/ms-project/processed/metashape/export/untiled.tif'
Mode: 'r'
Current Image Directory: 1
Number Of Strips: 1336
SubFileType: Tiff.SubFileType.Default
Photometric: Tiff.Photometric.RGB
ImageLength: 42746
ImageWidth: 35366
RowsPerStrip: 32
BitsPerSample: 8
Compression: Tiff.Compression.LZW
SampleFormat: Tiff.SampleFormat.UInt
SamplesPerPixel: 3
PlanarConfiguration: Tiff.PlanarConfiguration.Chunky
Orientation: Tiff.Orientation.TopLeft
the important part is that the tif is stripped, where each strip is an individual collection of contiguous row of bitmapped image data which makes random access much easier and xarray seems to take advantage of the stripped image by just loading the required data - I just checked this by loading the GeoTIFF calling the backend module used by xarray.open_rasterio called rasterio
>>> import rasterio
>>> ortho = rasterio.open("./orthomosaic.tif")
>>> ortho.is_tiled
False
>>> ortho.block_shapes
[(32, 248489), (32, 248489), (32, 248489)]
dask_image on the other hand does not take advantage of those stripes but loads the whole image into one chunk therefore running out out memory - even if you write a tiled tif file dask_image will not consider this
>>> import dask_image.imread
>>> ortho = dask_image.imread.imread("./orthomosaic.tif")
>>> ortho
dask.array<from-value, shape=(1, 104700, 248489, 3), dtype=uint8, chunksize=(1, 104700, 248489, 3), chunktype=numpy.ndarray>