Hopefully I just can't see the forest for the trees but my full text search behaves very strangely and I cannot solve this by myself. (I tried to search for a solution but so far no luck, so any help is greatly appreciated.)
So my problem is: if I search for "tök" (it means "pumpkin" in Hungarian) the list also contains results with "tok" (which means "case"). If I search for a pumpkin I clearly don't want a phone case or such things.
My system is MySQL every table is in InnoDB, utf8_general_ci
this is the ( simplified ) query:
SELECT id_item,item_title,tag_name, MATCH (item_title) AGAINST ('tök' IN NATURAL LANGUAGE MODE) AS title_relevance, MATCH (tag_name) AGAINST ('tök' IN NATURAL LANGUAGE MODE) AS tag_relevance
FROM item_translations
WHERE NULL IS NULL
AND ( MATCH (tile_item_title) AGAINST ('+tök' IN NATURAL LANGUAGE MODE ) OR MATCH (tag_name) AGAINST ('+tök' IN NATURAL LANGUAGE MODE ) )
AND id_language=1
ORDER BY title_relevance DESC, tag_relevance DESC
LIMIT 0,40
PS: the keywords are not always in Hungarian because this website is multilingual so I need a relatively flexible solution which works with most of the accented letters (if it's possible)
Equality in a string comparison is specified by the collation. general
will treat every letter like their (latin) base character. You need to specify a collation that supports the accents and umlauts that you want to be distinct.
The collation includes the language specifics. E.g. for spanish, n < ñ < o
(while n = ñ
for basically every other language), for swedish you have Y = Ü
, for german (and most collations) there is ß = ss
, and for hungarian (and many other collations) you have o < ö
.
So for a hungarian site, you may want to choose utf8_hungarian_ci
, and if your software is localizable to a specific language (and audience), you may want to adjust the collation of that column or let the administrator choose one. Unfortunately, for a fulltext search (in contrast to other string comparisons like =
or order by
), you cannot specify a collation in the query on the fly, so you need to choose a single one.
On a general multilanguage site, most users will probably expect a search to fit a very general english/russian/chinese-schema, and would not be surprised if they find tök
when entering tok
. They might even be irritated to not get those, especially if they do not have an ö
on their keyboard and actually want to buy a pumpkin (and know the hungarian word for it). Most search engines will actually try to not be too narrow, and want to find café
when you enter cafe
, and oftentimes put some work into being able to find café
when you enter coffee
, caffé
or cafée
.
There is no language that will handle every accent and umlaut differently though. If you really want to distinguish every single special character, you may want to try utf8_bin
(although I am not sure if I would call it most flexible). It is important to note that it is case sensitive, but since a fulltext search is always case insensitive, this would not matter. If you do other string comparisons on this column (e.g. like
), this can be problematic though. Also, you will loose language specific behaviour, e.g. Y = Ü
or ß = ss
(unless you implement it yourself).