I'm working on a search engine for a French website with Zend_Search_Lucene as a standalone component. Everything works well on my local webserver (WAMP) on windows, but the search with accented words (like: géographie) don't work on my production server (which is running on Unix).
I generated the index on Linux, the accented words are indexed correctly.
See a screenshot of my generated index here
I tried to force the encoding with the parameters of the analyser, convert the query string with utf8_encode. But i still can't get it works.
I call Lucene with those parameters:
Zend_Search_Lucene_Search_QueryParser::setDefaultOperator(Zend_Search_Lucene_Search_QueryParser::B_AND);
Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive());
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');
$index = Zend_Search_Lucene::open($cheminIndexes);
$resultats = $index->find(Zend_Search_Lucene_Search_QueryParser::parse(utf8_encode($_POST['recherche'])));
This code returns all the non-accented words, but it don't returns any of my accented words although those words are indexed. It's frustrating because i don't understand why it works on windows, i feel i'm missing a layer of encoding somewhere but i can't find any information about this on google.
I have a site setup with the exact same options as yours (insensitive, utf-8, AND). However, I used to create the index object via:
$index = new Zend_Search_Lucene('/path/to/index');
and not through the proxy (as in your case via Zend_Search_Lucene::open
, but that should not make any difference).
Also I just pass the query (after a short sanity check), directly to the index (without parsing):
$query = $_GET['q'];
...
$results = $index->find($query);