I have a leading wildcard query I know it is hard to optimize. I know if I use only the trailing wildcard I can do some optimization. But our client wants the leading wildcard because when searching mobile number they always use the last N digits.
The start digits is always 1[3-9]xx
, as the mobile format China is always 1[3-9]x-xxxx-xxxx, so using like 136%
will get too many results and is not that useful.
select id from customer where seid = 134 and telephone like '%18749618910%'
CREATE TABLE `customer` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`seid` bigint(20) unsigned NOT NULL DEFAULT '0' ,
`ccgeid` bigint(20) unsigned NOT NULL DEFAULT '0',
`cid` bigint(20) unsigned NOT NULL DEFAULT '0' /* foreign key */
`telephone` varchar(63) COLLATE utf8mb4_unicode_ci NOT NULL
`create_time` bigint(20) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`,`seid`),
KEY `s_telephone` (`seid`,`telephone`)
UNIQUE KEY `seid_ccgeid_phone` (`seid`,`ccgeid`,`telephone`),
KEY `s_cid` (`seid`,`cid`),
)
Resorting to FTS is the long term goal. But since I only need id
here I did some experiment to see if I improve some query performance. But I was surprised to see select * ...
is faster than select id ...
. I have around 2.5 millions records, select *
alway takes about 1 second while select id
takes about 4 to 5 seconds.
The explain Extra information shows select id
Using where while select *
Using index condition, both with same possible_keys.
But as Bill Karwin explained in What does MySQl explain Extra "Using where" really mean?, Using where does not help, while I can't find a clear explanation for Using index condition
(mysql document does not help).
So why select * is much faster than select id in my case ? Since I need id
in the end how do I utilize that to optimize my query ?
Based on the comments I got from @Amadan ("all you need to trigger the ICP optimisation is to return any one column that is not found in the index"), I did some further test to confirm his hypothesis (although I have not fully got it) that select id, create_time
is as fast as select *
because they both use Using index condition while select id, ccgeid
is as slow as select id
as they both use Using where.
PS. the explain details:
explain select id from customer where seid = 134 and telephone like '%18749618910%'
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: customer
partitions: p6
type: ref
possible_keys: seid_ccgeid_phone,s_cid,s_telephone
key: s_cid
key_len: 8
ref: const
rows: 1278146
filtered: 11.11
Extra: Using where
explain select * from customer where seid = 134 and telephone like '%18749618910%'
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: customer
partitions: p6
type: ref
possible_keys: seid_ccgeid_phone,s_cid,s_telephone
key: seid_ccgeid_phone
key_len: 8
ref: const
rows: 1278146
filtered: 11.11
Extra: Using index condition
--- update ---
According to @RickJames suggestion I added FORCE INDEX(s_telephone)
to my query and following are the result:
For select *
, with or without FORCE INDEX(s_telephone)
there is no obvious performance difference and they both use Using index condition
For select id
with FORCE INDEX(s_telephone)
there is obvious performance improvement from 4 seconds to around 1.5 seconds, but still slower than select *
. It use Using where; Using index
now.
I'll try to collate everything I wrote in comments.
LIKE '%...%'
criteria cannot use indices to directly find the rows; LIKE '...%'
criteria can.
This means, telephone like '%18749618910%'
cannot directly utilise any of the indices to speed up the search.
This means, the only column that can be found using index is seid
; since indices must be used left-to-right, three indices are equally eligible (i.e. for the purposes of your query, any of the indices starting with seid
can be used, since none of the other index fields are useful).
This means, your query cannot just use the index to find your data; it can filter on seid
using the index, but then has to search the found rows one by one to find which ones match the telephone = '%...%'
pattern.
MySQL can do this in two ways. One is the naive way that does just that: filters the rows using the index, then filters the resultant rows again by inspecting each row. This is Using where
strategy.
The other strategy is the procedure called Index Condition Pushdown Optimization. It speeds up this process in certain situations. Here is the description from that page:
Index Condition Pushdown (ICP) is an optimization for the case where MySQL retrieves rows from a table using an index. Without ICP, the storage engine traverses the index to locate rows in the base table and returns them to the MySQL server which evaluates the
WHERE
condition for the rows. With ICP enabled, and if parts of theWHERE
condition can be evaluated by using only columns from the index, the MySQL server pushes this part of theWHERE
condition down to the storage engine. The storage engine then evaluates the pushed index condition by using the index entry and only if this is satisfied is the row read from the table. ICP can reduce the number of times the storage engine must access the base table and the number of times the MySQL server must access the storage engine.
So: when ICP is not active, MySQL server asks the storage engine looks into the index, finds all rows where seid = 134
, gives them to the server, which then asks the engine for each of those rows in order to be able to confirm telephone LIKE '%18749618910%'
When ICP is active, MySQL server asks the storage engine to look at the index to find the rows where seid = 134
, but also asks it to confirm the telephone
criterion, because the index also contains the telephone
data. It cannot be used for searching, but it can be evaluated; so the storage engine doesn't need to load the table row, and can evaluate whether or not telephone LIKE '%18749618910%'
just based on the information found in the index entries of the seid = 134
rows. This cuts back on the back-and-forth communication between the MySQL server and the storage engine.
The ICP optimisation page specifies the cases where it can be employed:
"ICP is used for the
range
,ref
,eq_ref
, andref_or_null
access methods when there is a need to access full table rows.
So it seems that we have three possible cases here:
If all the criteria can be found using the index, the Using index
strategy is optimal. This is not your case (because of non-prefix LIKE
).
If some of the criteria can be found using the index, and some other criteria can be evaluated from the index, and there is a need to access full table rows, ICP (Using index condition
) can be used. This is your second case (since *
and create_time
need information that cannot be found in the index, and the full table row is required).
If the other criteria can be neither found nor evaluated using the index, or if there is no need to access full table rows, ICP cannot be used, and the regular Using where
strategy must be used. This is your first case (since id
and ccgeid
can be read from the index being used for the filtering, removing the need to read the table row).
Of course, the first (and best) case is not currently available to you, because telephone
in your schema is not searchable using an index.
If the search is always for the last N characters of telephone
, my advice is to create a column reversed_telephone
. For example, if telephone
is 123456789
, then reversed_telephone
should be 987654321
. To search for the last 3 digits (789
), look for reversed_telephone LIKE '987%'
. To search for the last 5 digits (56789
), look for reversed_telephone LIKE '98765%'
. This way you do not need to care whether the customer remembers 3, 4 or 5 digits; the only requirement is that they know the end. Using INDEX (seid, reversed_telephone)
, you would fall into case 1 above, the optimal strategy, regardless of which fields you SELECT
.