phpsearchfull-text-searchzend-lucene

Zend_lucene search with accents


I'm working on a search engine for a French website with Zend_Search_Lucene as a standalone component. Everything works well on my local webserver (WAMP) on windows, but the search with accented words (like: géographie) don't work on my production server (which is running on Unix).

I generated the index on Linux, the accented words are indexed correctly.

See a screenshot of my generated index here

I tried to force the encoding with the parameters of the analyser, convert the query string with utf8_encode. But i still can't get it works.

I call Lucene with those parameters:

Zend_Search_Lucene_Search_QueryParser::setDefaultOperator(Zend_Search_Lucene_Search_QueryParser::B_AND);
Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive());
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');

$index = Zend_Search_Lucene::open($cheminIndexes);
$resultats = $index->find(Zend_Search_Lucene_Search_QueryParser::parse(utf8_encode($_POST['recherche'])));

This code returns all the non-accented words, but it don't returns any of my accented words although those words are indexed. It's frustrating because i don't understand why it works on windows, i feel i'm missing a layer of encoding somewhere but i can't find any information about this on google.


Solution

  • I have a site setup with the exact same options as yours (insensitive, utf-8, AND). However, I used to create the index object via:

    $index = new Zend_Search_Lucene('/path/to/index');
    

    and not through the proxy (as in your case via Zend_Search_Lucene::open, but that should not make any difference).

    Also I just pass the query (after a short sanity check), directly to the index (without parsing):

    $query = $_GET['q'];
    ...
    $results = $index->find($query);