In my MySQL database I have this result querying my data:
mysql> select count(*) from emails where email like '%johndoe%';
+----------+
| count(*) |
+----------+
| 102 |
+----------+
1 row in set (15.50 sec)
My data is indexed under Sphinx (Manticore Search actually) with min_word_len = 1
. Now, when I search with SphinxQL I get only partial results:
mysql> SELECT count(*) FROM search1 WHERE MATCH('@email johndoe') LIMIT 1000 OPTION max_matches=1000;
+----------+
| count(*) |
+----------+
| 16 |
+----------+
1 row in set (0.00 sec)
Any idea how to match the results MySQL give me? I tried SPH_MATCH_ANY
and SPH_MATCH_EXTENDED
with the sphinxapi, same results.
I suspect it's mainly due to whole word matching. Sphinx matches whole words. With 'words' defined as per charset_table http://sphinxsearch.com/docs/current/conf-charset-table.html
ie MATCH('@email johndoe')
is only matching addresses with johndoe
in them as a word. default charset_table keeps .
-
and @
(common in emails!) all as separators so would match johndoe@domain.com
or email@johndoe.com
, but NOT email@myjohndoe.com
, the word being indexed is myjohndoe
not johndoe
Whereas mysql 'LIKE' will happy match part words. eg email like '%johndoe%'
would johndoesmith@domain.com
, johndoes555@domain.com
and 555@johndoes.com
or whatever. Its a pure substring match.
In short might want to tweak charset_table
. could .
-
and @
all be word chars, so email would be whole word.
alternatively might just enable part word matching with min_infix_len
.
http://sphinxsearch.com/docs/current.html#conf-min-infix-len
then could do MATCH('@email *johndoe*')
which would get much closer results.
complementary to min_infix_len
would be expand_keywords
http://sphinxsearch.com/docs/current.html#conf-expand-keywords
then the * wildcards would be added automatically, so could go back to MATCH('@email johndoe')