I have the following database (simplified):
CREATE TABLE `tracking` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`manufacture` varchar(100) NOT NULL,
`date_last_activity` datetime NOT NULL,
`date_created` datetime NOT NULL,
`date_updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `manufacture` (`manufacture`),
KEY `manufacture_date_last_activity` (`manufacture`, `date_last_activity`),
KEY `date_last_activity` (`date_last_activity`),
) ENGINE=InnoDB AUTO_INCREMENT=401353 DEFAULT CHARSET=utf8
CREATE TABLE `tracking_items` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`tracking_id` int(11) NOT NULL,
`tracking_object_id` varchar(100) NOT NULL,
`tracking_type` int(11) NOT NULL COMMENT 'Its used to specify the type of each item, e.g. car, bike, etc',
`date_created` datetime NOT NULL,
`date_updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `tracking_id` (`tracking_id`),
KEY `tracking_object_id` (`tracking_object_id`),
KEY `tracking_id_tracking_object_id` (`tracking_id`,`tracking_object_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1299995 DEFAULT CHARSET=utf8
CREATE TABLE `cars` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`car_id` varchar(255) NOT NULL COMMENT 'It must be VARCHAR, because the data is coming from external source.',
`manufacture` varchar(255) NOT NULL,
`car_text` text CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`date_order` datetime NOT NULL,
`date_created` datetime NOT NULL,
`date_updated` datetime NOT NULL,
`deleted` tinyint(4) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `car_id` (`car_id`),
KEY `sort_field` (`date_order`)
) ENGINE=InnoDB AUTO_INCREMENT=150000025 DEFAULT CHARSET=utf8
This is my "problematic" query, that runs extremely slow.
SELECT sql_no_cache `t`.*,
count(`t`.`id`) AS `cnt_filtered_items`
FROM `tracking` AS `t`
INNER JOIN `tracking_items` AS `ti` ON (`ti`.`tracking_id` = `t`.`id`)
LEFT JOIN `cars` AS `c` ON (`c`.`car_id` = `ti`.`tracking_object_id`
AND `ti`.`tracking_type` = 1)
LEFT JOIN `bikes` AS `b` ON (`b`.`bike_id` = `ti`.`tracking_object_id`
AND `ti`.`tracking_type` = 2)
LEFT JOIN `trucks` AS `tr` ON (`tr`.`truck_id` = `ti`.`tracking_object_id`
AND `ti`.`tracking_type` = 3)
WHERE (`t`.`manufacture` IN('1256703406078',
'9600048390403',
'1533405067830'))
AND (`c`.`car_text` LIKE '%europe%'
OR `b`.`bike_text` LIKE '%europe%'
OR `tr`.`truck_text` LIKE '%europe%')
GROUP BY `t`.`id`
ORDER BY `t`.`date_last_activity` ASC,
`t`.`id` ASC
LIMIT 15
This is the result of EXPLAIN
for above query:
+----+-------------+-------+--------+-----------------------------------------------------------------------+-------------+---------+-----------------------------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
+----+-------------+-------+--------+-----------------------------------------------------------------------+-------------+---------+-----------------------------+---------+----------------------------------------------+
| 1 | SIMPLE | t | index | PRIMARY,manufacture,manufacture_date_last_activity,date_last_activity | PRIMARY | 4 | NULL | 400,000 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | ti | ref | tracking_id,tracking_object_id,tracking_id_tracking_object_id | tracking_id | 4 | table.t.id | 1 | NULL |
| 1 | SIMPLE | c | eq_ref | car_id | car_id | 767 | table.ti.tracking_object_id | 1 | Using where |
| 1 | SIMPLE | b | eq_ref | bike_id | bike_id | 767 | table.ti.tracking_object_id | 1 | Using where |
| 1 | SIMPLE | t | eq_ref | truck_id | truck_id | 767 | table.ti.tracking_object_id | 1 | Using where |
+----+-------------+-------+--------+-----------------------------------------------------------------------+-------------+---------+-----------------------------+---------+----------------------------------------------+
What is the problem this query is trying to solve?
Basically, I need to find all records in tracking
table that may be associated with records in tracking_items
(1:n) where each record in tracking_items
may be associated with record in left joined tables. The filtering criteria is crucial part in the query.
What is the problem I have with the query above?
When there's order by
and group by
clauses the query runs extremely slow, e.g. 10-15 seconds to complete for the above configuration. However, if I omit any of these clauses, the query is running pretty quick (~0.2 seconds).
What I've already tried?
FULLTEXT
index, but it didn't help much, as the results evaluated by the LIKE
statemenet are narrowed by the JOINs
using indexes.WHERE EXISTS (...)
to find if there are records in left
joined tables, but unfortunately without any luck.Few notes about relations between these tables:
tracking -> tracking_items (1:n)
tracking_items -> cars (1:1)
tracking_items -> bikes (1:1)
tracking_items -> trucks (1:1)
So, I'm looking for a way to optimize that query.
Bill Karwin suggests the query might perform better if it used an index with a leading column of manufacture
. I second that suggestion. Especially if that's very selective.
I also note that we're doing a GROUP BY t.id
, where id
is the PRIMARY KEY of the table.
No columns from any tables other than tracking
are referenced in the SELECT
list.
This suggests we're really only interested in returning rows from t
, and not on creating duplicates due to multiple outer joins.
Seems like the COUNT()
aggregate has the potential to return an inflated count, if there are multiple matching rows in tracking_item
and bikes
,cars
,trucks
. If there's three matching rows from cars, and four matching rows from bikes, ... the COUNT() aggregate is going to return a value of 12, rather than 7. (Or maybe there is some guarantee in the data such that there won't ever be multiple matching rows.)
If the manufacture
is very selective, and that returns a reasonably small set of rows from tracking
, if the query can make use of an index ...
And since we aren't returning any columns from any tables other than tracking
, apart from a count or related items ...
I would be tempted to test correlated subqueries in the SELECT list, to get the count, and filter out the zero count rows using a HAVING clause.
Something like this:
SELECT SQL_NO_CACHE `t`.*
, ( ( SELECT COUNT(1)
FROM `tracking_items` `tic`
JOIN `cars` `c`
ON `c`.`car_id` = `tic`.`tracking_object_id`
AND `c`.`car_text` LIKE '%europe%'
WHERE `tic`.`tracking_id` = `t`.`id`
AND `tic`.`tracking_type` = 1
)
+ ( SELECT COUNT(1)
FROM `tracking_items` `tib`
JOIN `bikes` `b`
ON `b`.`bike_id` = `tib`.`tracking_object_id`
AND `b`.`bike_text` LIKE '%europe%'
WHERE `tib`.`tracking_id` = `t`.`id`
AND `tib`.`tracking_type` = 2
)
+ ( SELECT COUNT(1)
FROM `tracking_items` `tit`
JOIN `trucks` `tr`
ON `tr`.`truck_id` = `tit`.`tracking_object_id`
AND `tr`.`truck_text` LIKE '%europe%'
WHERE `tit`.`tracking_id` = `t`.`id`
AND `tit`.`tracking_type` = 3
)
) AS cnt_filtered_items
FROM `tracking` `t`
WHERE `t`.`manufacture` IN ('1256703406078', '9600048390403', '1533405067830')
HAVING cnt_filtered_items > 0
ORDER
BY `t`.`date_last_activity` ASC
, `t`.`id` ASC
We'd expect that the query could make effective use of an index on tracking
with leading column of manufacture
.
And on the tracking_items
table, we want an index with leading columns of type
and tracking_id
. And including tracking_object_id
in that index would mean the query could be satisfied from the index, without visiting the underlying pages.
For the cars
, bikes
and trucks
tables the query should make use of an index with leading column of car_id
, bike_id
, and truck_id
respectively. There's no getting around a scan of the car_text
, bike_text
, truck_text
columns for the matching string... best we can do is narrow down the number rows that need to have that check performed.
This approach (just the tracking
table in the outer query) should eliminate the need for the GROUP BY
, the work required to identify and collapse duplicate rows.
BUT this approach, replacing joins with correlated subqueries, is best suited to queries where there is a SMALL number of rows returned by the outer query. Those subqueries get executed for every row processed by the outer query. It's imperative that those subqueries to have suitable indexes available. Even with those tuned, there is still potential for horrible performance for large sets.
This does still leave us with a "Using filesort" operation for the ORDER BY
.
If the count of related items should be the product of a multiplication, rather than addition, we could tweak the query to achieve that. (We'd have to muck with the return of zeros, and the condition in the HAVING clause would need to be changed.)
If there wasn't a requirement to return a COUNT() of related items, then I would be tempted to move the correlated subqueries from the SELECT list down into EXISTS
predicates in the WHERE
clause.
Additional notes: seconding the comments from Rick James regarding indexing... there appears to be redundant indexes defined. i.e.
KEY `manufacture` (`manufacture`)
KEY `manufacture_date_last_activity` (`manufacture`, `date_last_activity`)
The index on the singleton column isn't necessary, since there is another index that has the column as the leading column.
Any query that can make effective use of the manufacture
index will be able to make effective use of the manufacture_date_last_activity
index. That is to say, the manufacture
index could be dropped.
The same applies for the tracking_items
table, and these two indexes:
KEY `tracking_id` (`tracking_id`)
KEY `tracking_id_tracking_object_id` (`tracking_id`,`tracking_object_id`)
The tracking_id
index could be dropped, since it's redundant.
For the query above, I would suggest adding a covering index:
KEY `tracking_items_IX3` (`tracking_id`,`tracking_type`,`tracking_object_id`)
-or- at a minimum, a non-covering index with those two columns leading:
KEY `tracking_items_IX3` (`tracking_id`,`tracking_type`)