I'm working in simplifying zip codes polygons in a MYSQL database (v 8.0), I'm reducing the number of coordinates for each polygon.
So, I have a table named zip_city
, which contains a column named boundary
, which is the original multipolygon column, and I created another one with the simplified polygons, boundary_simplified
. Both of them have SRID 4326 (I've included the is_point
column because it might be important):
+---------------------+--------------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------------+--------------------------------+------+-----+---------+----------------+
| boundary | multipolygon | NO | MUL | NULL | |
| is_point | tinyint unsigned | NO | MUL | 0 | |
| boundary_simplified | multipolygon | NO | MUL | NULL | |
+---------------------+--------------------------------+------+-----+---------+----------------+
Running a SHOW INDEXES, I have this:
mysql> SHOW INDEXES FROM zip_city;
+----------+------------+---------------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+----------+------------+---------------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| zip_city | 1 | idx_is_point | 1 | is_point | A | 2 | NULL | NULL | | BTREE | | | YES | NULL |
| zip_city | 1 | boundary | 1 | boundary | A | 34287 | 32 | NULL | | SPATIAL | | | YES | NULL |
| zip_city | 1 | boundary_simplified | 1 | boundary_simplified | A | 34287 | 32 | NULL | | SPATIAL | | | YES | NULL |
+----------+------------+---------------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
which looks exactly the same, but when I try to run a query using st_contains
, it does not work the same for them, for example:
mysql> SELECT zip FROM zip_city
WHERE
ST_CONTAINS(boundary, ST_GeomFromGeoJSON('{"type": "Point", "coordinates": [-131.64, 55.34]}'))
AND is_point = 0 LIMIT 1;
+-------+
| zip |
+-------+
| 99901 |
+-------+
1 row in set (0.03 sec)
mysql> SELECT zip FROM zip_city
WHERE
ST_CONTAINS(boundary_simplified, ST_GeomFromGeoJSON('{"type": "Point", "coordinates": [-131.64, 55.34]}'))
AND
is_point = 0 LIMIT 1;
+-------+
| zip |
+-------+
| 99901 |
+-------+
1 row in set (4.84 sec)
And when I explain both queries, I see that the one using boundary_simplified is not using the index:
mysql> EXPLAIN SELECT zip FROM zip_city
WHERE
ST_CONTAINS(boundary, ST_GeomFromGeoJSON('{"type": "Point", "coordinates": [-131.64, 55.34]}'))
AND
is_point = 0 LIMIT 1;
+----+-------------+----------+------------+-------+-----------------------+----------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+-------+-----------------------+----------+---------+------+------+----------+-------------+
| 1 | SIMPLE | zip_city | NULL | range | idx_is_point,boundary | boundary | 34 | NULL | 1 | 50.00 | Using where |
+----+-------------+----------+------------+-------+-----------------------+----------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
mysql> EXPLAIN SELECT zip FROM zip_city
WHERE
ST_CONTAINS(boundary_simplified, ST_GeomFromGeoJSON('{"type": "Point", "coordinates": [-131.64, 55.34]}'))
+----+-------------+----------+------------+------+---------------+--------------+---------+-------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+--------------+---------+-------+-------+----------+-------------+
| 1 | SIMPLE | zip_city | NULL | ref | idx_is_point | idx_is_point | 1 | const | 17143 | 100.00 | Using where |
+----+-------------+----------+------------+------+---------------+--------------+---------+-------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
Any clue on this? I feel like I'm missing something simple but I cannot find information about this. Also when creating the index, for the boundary
column takes ~23.25 sec and for the boundary_simplified
it takes only ~0.75 sec (which is weird. Do the coordinates affect the efficiency of the index?)
I've tried deleting both indexes and creating them separately, I tested the behavior w/o the index which changed of course, I've tried to use FORCE INDEX or USE INDEX inside the query which resulted in same/worse behavior.
EDIT: I fixed the indexes shown thanks to user1191247 observation. Also, I'm not showing the full table information as it is useless.
Thanks to user1191247 comment, I looked for the information he asked and I've found this:
| zip_city | CREATE TABLE `zip_city` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`state_id` int unsigned NOT NULL,
`zip` mediumint(5) unsigned zerofill NOT NULL,
`city` varchar(64) NOT NULL,
`slug` varchar(64) NOT NULL,
`location` point NOT NULL /*!80003 SRID 4326 */,
`boundary` multipolygon NOT NULL /*!80003 SRID 4326 */,
`is_point` tinyint unsigned NOT NULL DEFAULT '0',
`fit_market` tinyint unsigned NOT NULL DEFAULT '0',
`boundary_simplified` multipolygon NOT NULL,
PRIMARY KEY (`id`),
KEY `fk_zip_to_city_state1_idx` (`state_id`),
KEY `idx_zip` (`zip`),
KEY `idx_slug` (`slug`),
KEY `idx_city` (`city`),
SPATIAL KEY `idx_location` (`location`),
SPATIAL KEY `boundary` (`boundary`),
SPATIAL KEY `boundary_simplified` (`boundary_simplified`),
CONSTRAINT `fk_zip_to_city_state1` FOREIGN KEY (`state_id`) REFERENCES `state` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=41381 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
where, as you can see, boundary_simplified
is missing the SRID definition, which is crucial for the index to work properly (with SELECT DISTINCT ST_SRID(boundary_simplified) FROM zip_city;
I had obtained the SRID 4326, so I didn't think this was the issue, but it was missing on the column definition). I've solved this by running these queries:
DROP INDEX boundary_simplified ON zip_city;
ALTER TABLE zip_city MODIFY COLUMN boundary_simplified MULTIPOLYGON NOT NULL SRID 4326;
(took ~53 sec)
ALTER TABLE zip_city ADD SPATIAL INDEX idx_boundary_simplified (boundary_simplified);
(now it took ~24 sec which was already good news)
Then the INDEX worked perfectly :)