I'm running a MediaWiki instance that I just upgraded to the latest version at the time of this writing, 1.32.0. This wiki is nearly 10 years old and has gone through a number of upgrades.
It's a wiki in French language, and something annoying for French speakers is that the built-in search has always considered accented characters different from their non-accented counterparts, version after version.
For example, searching for Aromathérapie
returns a number of results, while searching for Aromatherapie
returns 0 results.
I thought that this was a database collation issue at first, until I noticed that the searchindex
table is actually populated with ASCII-encoded UTF-8 words. Taking the example above, aromathérapie
is stored as aromathu8c3a9rapie
, so changing the table collation does not help.
Digging through the source code, I found the SearchMySQL::normalizeText() method that is responsible for this encoding.
And as far as I can see, the only normalization that this method does prior to encoding is lowercasing:
MediaWikiServices::getInstance()->getContentLanguage()->lc( $out )
So as it stands, it looks like there is no way to make the built-in search ignore accents.
I googled quite a lot for solutions, and found mostly old, unrelevant threads. I'm really surprised to not find more literature on the subject.
How can I make the MediaWiki search case- AND accents- insensitive?
I'm not proud of it, but here's how I solved it, using MySQL's built-in support for collations (which does work with fulltext indexes—at least in recent versions of MySQL—contrary to what the code says):
searchindex
table to utf8mb4
:ALTER TABLE searchindex CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
includes/search/SearchMySQL.php
:
u
flag in preg_replace()
searchindex
table:
php maintenance/rebuildtextindex.php
A similar procedure will have to be applied whenever the MediaWiki installation is updated, which adds to the maintenance cost. The procedure being simple, it's a cost I'm willing to accept right now.
A final note is that this does not make the autocompletion work case-insensitively, only the search results. This is good enough for me for now.