mysqlindexingspatialspatial-queryspatial-index

Two similiar SPATIAL columns with the same INDEX working differently


I'm working in simplifying zip codes polygons in a MYSQL database (v 8.0), I'm reducing the number of coordinates for each polygon.

So, I have a table named zip_city, which contains a column named boundary, which is the original multipolygon column, and I created another one with the simplified polygons, boundary_simplified. Both of them have SRID 4326 (I've included the is_point column because it might be important):

+---------------------+--------------------------------+------+-----+---------+----------------+
| Field               | Type                           | Null | Key | Default | Extra          |
+---------------------+--------------------------------+------+-----+---------+----------------+

| boundary            | multipolygon                   | NO   | MUL | NULL    |                |
| is_point            | tinyint unsigned               | NO   | MUL | 0       |                |
| boundary_simplified | multipolygon                   | NO   | MUL | NULL    |                |
+---------------------+--------------------------------+------+-----+---------+----------------+

Running a SHOW INDEXES, I have this:

mysql> SHOW INDEXES FROM zip_city;
+----------+------------+---------------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table    | Non_unique | Key_name                  | Seq_in_index | Column_name         | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+----------+------------+---------------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| zip_city |          1 | idx_is_point              |            1 | is_point            | A         |           2 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| zip_city |          1 | boundary                  |            1 | boundary            | A         |       34287 |       32 |   NULL |      | SPATIAL    |         |               | YES     | NULL       |
| zip_city |          1 | boundary_simplified       |            1 | boundary_simplified | A         |       34287 |       32 |   NULL |      | SPATIAL    |         |               | YES     | NULL       |
+----------+------------+---------------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

which looks exactly the same, but when I try to run a query using st_contains, it does not work the same for them, for example:

mysql> SELECT zip      FROM zip_city
      WHERE
          ST_CONTAINS(boundary, ST_GeomFromGeoJSON('{"type": "Point", "coordinates": [-131.64, 55.34]}'))
 AND          is_point = 0      LIMIT 1;
+-------+
| zip   |
+-------+
| 99901 |
+-------+
1 row in set (0.03 sec)
mysql> SELECT zip      FROM zip_city
      WHERE
          ST_CONTAINS(boundary_simplified, ST_GeomFromGeoJSON('{"type": "Point", "coordinates": [-131.64, 55.34]}'))
 AND
         is_point = 0      LIMIT 1;
+-------+
| zip   |
+-------+
| 99901 |
+-------+
1 row in set (4.84 sec)

And when I explain both queries, I see that the one using boundary_simplified is not using the index:

mysql> EXPLAIN SELECT zip      FROM zip_city
      WHERE
          ST_CONTAINS(boundary, ST_GeomFromGeoJSON('{"type": "Point", "coordinates": [-131.64, 55.34]}'))
 AND
      is_point = 0      LIMIT 1;
+----+-------------+----------+------------+-------+-----------------------+----------+---------+------+------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys         | key      | key_len | ref  | rows | filtered | Extra       |
+----+-------------+----------+------------+-------+-----------------------+----------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | zip_city | NULL       | range | idx_is_point,boundary | boundary | 34      | NULL |    1 |    50.00 | Using where |
+----+-------------+----------+------------+-------+-----------------------+----------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql> EXPLAIN SELECT zip      FROM zip_city
      WHERE
          ST_CONTAINS(boundary_simplified, ST_GeomFromGeoJSON('{"type": "Point", "coordinates": [-131.64, 55.34]}')) 
+----+-------------+----------+------------+------+---------------+--------------+---------+-------+-------+----------+-------------+
| id | select_type | table    | partitions | type | possible_keys | key          | key_len | ref   | rows  | filtered | Extra       |
+----+-------------+----------+------------+------+---------------+--------------+---------+-------+-------+----------+-------------+
|  1 | SIMPLE      | zip_city | NULL       | ref  | idx_is_point  | idx_is_point | 1       | const | 17143 |   100.00 | Using where |
+----+-------------+----------+------------+------+---------------+--------------+---------+-------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

Any clue on this? I feel like I'm missing something simple but I cannot find information about this. Also when creating the index, for the boundary column takes ~23.25 sec and for the boundary_simplified it takes only ~0.75 sec (which is weird. Do the coordinates affect the efficiency of the index?)

I've tried deleting both indexes and creating them separately, I tested the behavior w/o the index which changed of course, I've tried to use FORCE INDEX or USE INDEX inside the query which resulted in same/worse behavior.

EDIT: I fixed the indexes shown thanks to user1191247 observation. Also, I'm not showing the full table information as it is useless.


Solution

  • Thanks to user1191247 comment, I looked for the information he asked and I've found this:

    | zip_city | CREATE TABLE `zip_city` (
      `id` int unsigned NOT NULL AUTO_INCREMENT,
      `state_id` int unsigned NOT NULL,
      `zip` mediumint(5) unsigned zerofill NOT NULL,
      `city` varchar(64) NOT NULL,
      `slug` varchar(64) NOT NULL,
      `location` point NOT NULL /*!80003 SRID 4326 */,
      `boundary` multipolygon NOT NULL /*!80003 SRID 4326 */,
      `is_point` tinyint unsigned NOT NULL DEFAULT '0',
      `fit_market` tinyint unsigned NOT NULL DEFAULT '0',
      `boundary_simplified` multipolygon NOT NULL,
      PRIMARY KEY (`id`),
      KEY `fk_zip_to_city_state1_idx` (`state_id`),
      KEY `idx_zip` (`zip`),
      KEY `idx_slug` (`slug`),
      KEY `idx_city` (`city`),
      SPATIAL KEY `idx_location` (`location`),
      SPATIAL KEY `boundary` (`boundary`),
      SPATIAL KEY `boundary_simplified` (`boundary_simplified`),
      CONSTRAINT `fk_zip_to_city_state1` FOREIGN KEY (`state_id`) REFERENCES `state` (`id`)
    ) ENGINE=InnoDB AUTO_INCREMENT=41381 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
    

    where, as you can see, boundary_simplified is missing the SRID definition, which is crucial for the index to work properly (with SELECT DISTINCT ST_SRID(boundary_simplified) FROM zip_city; I had obtained the SRID 4326, so I didn't think this was the issue, but it was missing on the column definition). I've solved this by running these queries:

    DROP INDEX boundary_simplified ON zip_city;
    
    ALTER TABLE zip_city MODIFY COLUMN boundary_simplified MULTIPOLYGON NOT NULL SRID 4326; 
    

    (took ~53 sec)

    ALTER TABLE zip_city ADD SPATIAL INDEX idx_boundary_simplified (boundary_simplified); 
    

    (now it took ~24 sec which was already good news)

    Then the INDEX worked perfectly :)