How to improve spatial join?

I am running this spatial join, for two polygon layers around 25 millions rows against 400 rows. I have checked for invalid geometries and no one appears..

I have set the storage as external and created index in the geom fields as gist.

table1_field1_idx (btree , not clustered, field1)
table1_geom_idx (gist, not clustered, geom)
table2_geom_idx (gist , not clustered, geom)
table2_ogc_fid_idx (btree , not clustered, ogc_fid)

Query plan

Nested Loop Inner Join 2. Seq Scan on table2 as att 3. Bitmap Heap Scan on ltable1 as g Recheck Cond: (g.geom && att.geom) 4. Bitmap Index Scan using table1_geom_idx Index Cond: (g.geom && att.geom)

but the performance is really poor, last tries before server disconnect spend more than 20 hours before failed because server disconnection.

I am working with postgresql 11 postgis, i have tried from both local and remote desktop in an aws instance and performance in both is similar.

SELECT g.field1,
    att.ogc_fid,
    st_intersection(g.geom, att.geom) AS intersect_geom,
    st_area(g.geom) AS geom_area,
    st_area(st_intersection(g.geom, att.geom)) AS intersect_area
   FROM table1 g
     JOIN table2 att ON g.geom && att.geom;

After going around I also have tried with subqueries but not sure if that will work work:

WITH g AS (
         SELECT field1,
            count(*) AS num_polygons,
            table1.geom
           FROM table1
          GROUP BY table1.field1, table1.geom
        )
SELECT g.field1,
    att.field2,
    st_intersection(g.geom, att.geom) AS intersect_geom,
    st_area(g.geom) AS geom_area,
    st_area(st_intersection(g.geom, att.geom)) AS intersect_area
   FROM g
     JOIN table2 att ON g.geom && att.geom
     WHERE ST_IsValid(att.geom) AND ST_IsValid(g.geom);

I would like to improve the performance any help would be apreciated.

Solution

This answer assumes that the geometry columns are properly indexed, e.g. using gist:

CREATE INDEX idx_tb1_geom ON table1 USING gist (geom);
CREATE INDEX idx_tb2_geom ON table2 USING gist (geom);

Your query is spatially joining two tables with 400 and 25 million records without a WHERE clause, which means that it is possible that it might return the Cartesian product of both tables. And this is a huge result set!

I reproduced your environment with table1 containing 403 records and table2 5 million records and it works in a pretty resealable speed. To make my point, if we simply count the amount of geometries that overlap (without actually retrieving them) the query runs pretty fast:

EXPLAIN (ANALYSE,COSTS OFF)  
SELECT count(att.*)
FROM table1 g
JOIN table2 att ON g.geom && att.geom;
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Aggregate (actual time=516.299..516.300 rows=1 loops=1)
   ->  Nested Loop (actual time=6.118..501.909 rows=281212 loops=1)
         ->  Seq Scan on table1 g (actual time=0.060..0.312 rows=403 loops=1)
         ->  Bitmap Heap Scan on table2 att (actual time=0.187..1.198 rows=698 loops=403)
               Recheck Cond: (g.geom && geom)
               Heap Blocks: exact=279398
               ->  Bitmap Index Scan on idx_tb2_geom (actual time=0.108..0.108 rows=698 loops=403)
                     Index Cond: (geom && g.geom)
 Planning Time: 0.505 ms
 Execution Time: 519.337 ms

Notice that the query filtered 281212 geometries in 519ms, which is not the worst performance for the amount of data. However, the picture would be a lot different if you decide to retrieve these 281212 geometries, as you'd then have other bottlenecks, such as memory available and network speed.

As a rule of thumb: never bring the data to the software, but the software to the data! Meaning that you should try to process your data as much as you can in the database before loading it into your application.