I have a large (250 Gb) BigQuery table representing a grid of values, with the grid cells stored as polygons. The table is clustered by this geography column. I have a second table with various polygons, and I want to get grid cell values within one of these polygons (For example, query some weather data within the boundaries of a specific country).
When I perform a spatial intersection of the grid table with a single polygon using ST_GEOGFROMGEOJSON, I see that the query cost (bytes processed) is low (100 Mb). But when I perform the same query with the polygon as a part of a select statement I see that the table is fully scanned and the cost is high (250 Gb).
This is a reproducible example. When querying the following public dataset using a specific polygon:
SELECT npp FROM `bigquery-public-data.modis_terra_net_primary_production.MODIS_MOD17A3HGF`
WHERE year=2020
AND ST_INTERSECTS(geography_polygon, ST_GEOGFROMGEOJSON('{ "type": "Polygon", "coordinates": [ [ [ -101.1, 50 ], [ -101, 50 ], [ -101, 56 ], [ -101.1, 56 ], [ -101.1, 50 ] ] ] }'))
-- Bytes processed -- 785.45 MB -- Bytes billed -- 786 MB
With the same polygon as part of a select statement in a table:
CREATE OR REPLACE TABLE `tmp.polygon`
AS SELECT ST_GEOGFROMGEOJSON('{ "type": "Polygon", "coordinates": [ [ [ -101.1, 50 ], [ -101, 50 ], [ -101, 56 ], [ -101.1, 56 ], [ -101.1, 50 ] ] ] }') geog;
SELECT npp FROM `bigquery-public-data.modis_terra_net_primary_production.MODIS_MOD17A3HGF`
WHERE year=2020
AND ST_INTERSECTS(geography_polygon, (SELECT geog FROM `tmp.polygon`))
-- Bytes processed -- 522.15 GB -- Bytes billed -- 522.15 GB
I got the same results using cross joins or inner joins.
Why doesn't clustering work with the select statements? Is there a way to decrease the query cost without adding custom clustering / partitioning columns?
This optimization is only implemented by BigQuery for constant geography in the filter.
A work around is to split the query into two parts using BigQuery procedural language (https://cloud.google.com/bigquery/docs/reference/standard-sql/procedural-language).
DECLARE poly GEOGRAPHY;
SET poly = (SELECT geog FROM `tmp.polygon`);
SELECT npp FROM `bigquery-public-data.modis_terra_net_primary_production.MODIS_MOD17A3HGF`
WHERE year=2020
AND ST_INTERSECTS(geography_polygon, poly);
This bills 786 MB.
Update for the case when the polygon is too complex and causes evaluation limit
error: let's store bounding box only, and post-filter using the actual polygon:
DECLARE box STRUCT<minx FLOAT64, miny FLOAT64, maxx FLOAT64, maxy FLOAT64>;
SET box = (SELECT ST_BoundingBox(geog) FROM `tmp.polygon`);
SELECT npp FROM `bigquery-public-data.modis_terra_net_primary_production.MODIS_MOD17A3HGF`
WHERE year=2020
AND ST_INTERSECTS(geography_polygon, (SELECT geog FROM `tmp.polygon`))
AND ST_INTERSECTSBOX(geography_polygon, box.minx, box.miny, box.maxx, box.maxy);
This also bills 786 MB, although in rare cases it might bill more due to more lax filter.