I am playing with 7.2 version of solr. I've uploaded a nice collection of texts in German language and trying to query and highlight a few queries.
If I fire this query with hightlight:
I get a nice text back:
<response>
<lst name="responseHeader">
<bool name="zkConnected">true</bool>
<int name="status">0</int>
<int name="QTime">10</int>
<lst name="params">
<str name="hl.snippets">3</str>
<str name="q">trans:Zeit</str>
<str name="hl">true</str>
<str name="hl.q">Kundigung</str>
<str name="hl.fl">trans</str>
<str name="rows">1</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="418" start="0" maxScore="1.6969817">
<doc>
<str name="id">x</str>
<str name="trans">... Zeit ...</str>
<date name="t">2018-03-01T14:32:29.400Z</date>
<int name="l">2305</int>
<long name="_version_">1594374122229465088</long>
</doc>
</result>
<lst name="highlighting">
<lst name="x">
<arr name="trans">
<str> ... <em>Kündigung</em> ... </str>
<str> ... <em>Kündigung</em> ... </str>
</arr>
</lst>
</lst>
</response>
However, if I supply the Kündigung
as highlight text, I get no answers, as the text/query parser replaced all the ü
characters with u
.
I have a feeling that I need to supply the correct qparser. How should I specify it? It seems to me that the collection was build with and queried with the default LuceneQParser
parser. How can I supply this parser in the url above?
UPDATE:
http://localhost:8983/solr/trans/schema/fields/trans
returns
{
"responseHeader":{
"status":0,
"QTime":0},
"field":{
"name":"trans",
"type":"text_de",
"indexed":true,
"stored":true}}
Update 2: So I've looked at the managed-schema of my solr installation/collection schema configuration and found the following:
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.GermanLightStemFilterFactory"/>
</analyzer>
</fieldType>
the way I interpret the information is that since query and index parts are omited, the above code is meant to be the same for both query and index. Which... does not show any misconfiguration issues similar to the answer 2 below...
I rememberred though, adding the field trans
with type text_de
:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"trans",
"type":"text_de",
"stored":true,
"indexed":true}
}' http://localhost:8983/solr/trans/schema
I've deleted all the documents using
curl http://localhost:8983/solr/trans/update?commit=true -d "<delete><query>*:*<
/query></delete>"
and then reinserting them again:
curl -X POST http://localhost:8983/solr/trans/update?commit=true -H "Content-Type: application/json" -d @all.json
Is it the correct way to "rebuild" the indexes in solr?
UPDATE 3: The Charset settings of the standart JAVA installation were not set to UTF-8:
C:\tmp>java -classpath . Hello
Cp1252
Cp1252
windows-1252
C:\tmp>cat Hello.java
public class Hello {
public static void main(String args[]) throws Exception{
// not crossplateform safe
System.out.println(System.getProperty("file.encoding"));
// jdk1.4
System.out.println(
new java.io.OutputStreamWriter(
new java.io.ByteArrayOutputStream()).getEncoding()
);
// jdk1.5
System.out.println(java.nio.charset.Charset.defaultCharset().name());
}
}
UPDATE 4: Restarted the solr with UTF8 settings:
bin\solr.cmd start -Dfile.encoding=UTF8 -c -p 8983 -s example/cloud/node1/solr
bin\solr.cmd start -Dfile.encoding=UTF8 -c -p 7574 -s example/cloud/node2/solr -z localhost:9983
Checked the JVM settings:
http://localhost:8983/solr/#/~java-properties
file.encoding UTF8
file.encoding.pkg sun.io
reinserted the docs. No change: http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=true&hl.fl=trans&hl.q=Kundigung&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml
gives:
<lst name="highlighting">
<lst name="32e42caa-313d-45ed-8095-52f2dd6861a1">
<arr name="trans">
<str> ... <em>Kündigung</em> ...</str>
<str> ... <em>Kündigung</em> ...</str>
</arr>
</lst>
</lst>
http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=true&hl.fl=trans&hl.q=K%C3%BCndigung&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml
gives:
<lst name="highlighting">
<lst name="32e42caa-313d-45ed-8095-52f2dd6861a1"/>
</lst>
uchardet all.json
(file -bi all.json
) reports UTF-8
Running from the ubuntu subsystem under windows:
$ export LC_ALL='en_US.UTF-8'
$ export LC_CTYPE='en_US.UTF-8'
$ curl -H "Content-Type: application/json" http://localhost:8983/solr/trans/query?hl=true\&hl.fl=trans\&fl=id -d '
{
"query" : "trans:Kündigung",
"limit" : "1", params: {"hl.q":"Kündigung"}
}'
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":21,
"params":{
"hl":"true",
"fl":"id",
"json":"\n{\n \"query\" : \"trans:Kündigung\",\n \"limit\" : \"1\", params: {\"hl.q\":\"Kündigung\"}\n}",
"hl.fl":"trans"}},
"response":{"numFound":124,"start":0,"maxScore":4.3724422,"docs":[
{
"id":"b952b811-3711-4bb1-ae3d-e8c8725dcfe7"}]
},
"highlighting":{
"b952b811-3711-4bb1-ae3d-e8c8725dcfe7":{}}}
$ curl -H "Content-Type: application/json" http://localhost:8983/solr/trans/query?hl=true\&hl.fl=trans\&fl=id -d '
{
"query" : "trans:Kündigung",
"limit" : "1", params: {"hl.q":"Kundigung"}
}'
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":18,
"params":{
"hl":"true",
"fl":"id",
"json":"\n{\n \"query\" : \"trans:Kündigung\",\n \"limit\" : \"1\", params: {\"hl.q\":\"Kundigung\"}\n}",
"hl.fl":"trans"}},
"response":{"numFound":124,"start":0,"maxScore":4.3724422,"docs":[
{
"id":"b952b811-3711-4bb1-ae3d-e8c8725dcfe7"}]
},
"highlighting":{
"b952b811-3711-4bb1-ae3d-e8c8725dcfe7":{
"trans":[" ... <em>Kündigung</em> ..."]}}}
UPDATE 5 Without supplying hl.q
(http://localhost:8983/solr/trans/select?q=trans:Kundigung&hl=true&hl.fl=trans&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml
or http://localhost:8983/solr/trans/select?q=trans:K%C3%BCndigung&hl=true&hl.fl=trans&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml
):
<lst name="highlighting">
<lst name="b952b811-3711-4bb1-ae3d-e8c8725dcfe7">
<arr name="trans">
<str> ... <em>Kündigung</em> ... </str>
<str> ... <em>Kündigung</em> ... </str>
<str> ... <em>Kündigung</em> ... </str>
</arr>
</lst>
</lst>
in this case, the hl.q
took the highlighting terms from the query itself, and did a superb job..
Check your analyzer chain too. I get the same behaviour as you described, when I misconfigure the chain this way:
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.GermanLightStemFilterFactory"/>
</analyzer>
</fieldType>
The GermanNormalizationFilterFactory
and GermanLightStemFilterFactory
both replaces umlauts.