solrquery-parser

Solr highlighting - terms with umlaut not found/not highlighted


I am playing with 7.2 version of solr. I've uploaded a nice collection of texts in German language and trying to query and highlight a few queries.

If I fire this query with hightlight:

http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=true&hl.fl=trans&hl.q=Kundigung&hl.snippets=3&wt=xml&rows=1

I get a nice text back:

<response>
    <lst name="responseHeader">
        <bool name="zkConnected">true</bool>
        <int name="status">0</int>
        <int name="QTime">10</int>
        <lst name="params">
            <str name="hl.snippets">3</str>
            <str name="q">trans:Zeit</str>
            <str name="hl">true</str>
            <str name="hl.q">Kundigung</str>
            <str name="hl.fl">trans</str>
            <str name="rows">1</str>
            <str name="wt">xml</str>
        </lst>
    </lst>
    <result name="response" numFound="418" start="0" maxScore="1.6969817">
        <doc>
            <str name="id">x</str>
            <str name="trans">... Zeit  ...</str>
            <date name="t">2018-03-01T14:32:29.400Z</date>
            <int name="l">2305</int>
            <long name="_version_">1594374122229465088</long>
        </doc>
    </result>
    <lst name="highlighting">
        <lst name="x">
            <arr name="trans">
                <str> ... <em>Kündigung</em> ... </str>
                <str> ... <em>Kündigung</em> ... </str>
            </arr>
        </lst>
    </lst>
</response>

However, if I supply the Kündigung as highlight text, I get no answers, as the text/query parser replaced all the ü characters with u.

I have a feeling that I need to supply the correct qparser. How should I specify it? It seems to me that the collection was build with and queried with the default LuceneQParser parser. How can I supply this parser in the url above?

UPDATE:

http://localhost:8983/solr/trans/schema/fields/trans returns

{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "field":{
    "name":"trans",
    "type":"text_de",
    "indexed":true,
    "stored":true}}

Update 2: So I've looked at the managed-schema of my solr installation/collection schema configuration and found the following:

  <fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
      <filter class="solr.GermanNormalizationFilterFactory"/>
      <filter class="solr.GermanLightStemFilterFactory"/>
    </analyzer>
  </fieldType>

the way I interpret the information is that since query and index parts are omited, the above code is meant to be the same for both query and index. Which... does not show any misconfiguration issues similar to the answer 2 below...

I rememberred though, adding the field trans with type text_de:

curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
     "name":"trans",
     "type":"text_de",
     "stored":true,
     "indexed":true}
}' http://localhost:8983/solr/trans/schema

I've deleted all the documents using

curl http://localhost:8983/solr/trans/update?commit=true -d "<delete><query>*:*<
/query></delete>"

and then reinserting them again:

curl  -X POST http://localhost:8983/solr/trans/update?commit=true -H "Content-Type: application/json" -d @all.json

Is it the correct way to "rebuild" the indexes in solr?

UPDATE 3: The Charset settings of the standart JAVA installation were not set to UTF-8:

C:\tmp>java -classpath . Hello
Cp1252
Cp1252
windows-1252

C:\tmp>cat Hello.java
public class Hello {
 public static void main(String args[]) throws Exception{
  // not crossplateform safe
  System.out.println(System.getProperty("file.encoding"));
  // jdk1.4
  System.out.println(
     new java.io.OutputStreamWriter(
        new java.io.ByteArrayOutputStream()).getEncoding()
     );
  // jdk1.5
  System.out.println(java.nio.charset.Charset.defaultCharset().name());
  }
}

UPDATE 4: Restarted the solr with UTF8 settings:

bin\solr.cmd start -Dfile.encoding=UTF8 -c -p 8983 -s example/cloud/node1/solr
bin\solr.cmd start -Dfile.encoding=UTF8 -c -p 7574 -s example/cloud/node2/solr -z localhost:9983

Checked the JVM settings:

http://localhost:8983/solr/#/~java-properties



file.​encoding    UTF8
file.​encoding.​pkg    sun.io

reinserted the docs. No change: http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=true&hl.fl=trans&hl.q=Kundigung&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml gives:

<lst name="highlighting">
    <lst name="32e42caa-313d-45ed-8095-52f2dd6861a1">
        <arr name="trans">
            <str> ... <em>Kündigung</em> ...</str>
            <str> ... <em>Kündigung</em> ...</str>
        </arr>
    </lst>
</lst>

http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=true&hl.fl=trans&hl.q=K%C3%BCndigung&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml gives:

<lst name="highlighting">
    <lst name="32e42caa-313d-45ed-8095-52f2dd6861a1"/>
</lst>

uchardet all.json (file -bi all.json) reports UTF-8

Running from the ubuntu subsystem under windows:

$ export LC_ALL='en_US.UTF-8'
$ export LC_CTYPE='en_US.UTF-8'
$ curl -H "Content-Type: application/json" http://localhost:8983/solr/trans/query?hl=true\&hl.fl=trans\&fl=id -d '
{
  "query" : "trans:Kündigung",
  "limit" : "1", params: {"hl.q":"Kündigung"}
}'
{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":21,
    "params":{
      "hl":"true",
      "fl":"id",
      "json":"\n{\n  \"query\" : \"trans:Kündigung\",\n  \"limit\" : \"1\", params: {\"hl.q\":\"Kündigung\"}\n}",
      "hl.fl":"trans"}},
  "response":{"numFound":124,"start":0,"maxScore":4.3724422,"docs":[
      {
        "id":"b952b811-3711-4bb1-ae3d-e8c8725dcfe7"}]
  },
  "highlighting":{
    "b952b811-3711-4bb1-ae3d-e8c8725dcfe7":{}}}
$ curl -H "Content-Type: application/json" http://localhost:8983/solr/trans/query?hl=true\&hl.fl=trans\&fl=id -d '
{
  "query" : "trans:Kündigung",
  "limit" : "1", params: {"hl.q":"Kundigung"}
}'
{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":18,
    "params":{
      "hl":"true",
      "fl":"id",
      "json":"\n{\n  \"query\" : \"trans:Kündigung\",\n  \"limit\" : \"1\", params: {\"hl.q\":\"Kundigung\"}\n}",
      "hl.fl":"trans"}},
  "response":{"numFound":124,"start":0,"maxScore":4.3724422,"docs":[
      {
        "id":"b952b811-3711-4bb1-ae3d-e8c8725dcfe7"}]
  },
  "highlighting":{
    "b952b811-3711-4bb1-ae3d-e8c8725dcfe7":{
      "trans":[" ... <em>Kündigung</em> ..."]}}}

UPDATE 5 Without supplying hl.q (http://localhost:8983/solr/trans/select?q=trans:Kundigung&hl=true&hl.fl=trans&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml or http://localhost:8983/solr/trans/select?q=trans:K%C3%BCndigung&hl=true&hl.fl=trans&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml):

<lst name="highlighting">
    <lst name="b952b811-3711-4bb1-ae3d-e8c8725dcfe7">
        <arr name="trans">
            <str> ... <em>Kündigung</em> ... </str>
            <str> ... <em>Kündigung</em> ... </str>
            <str> ... <em>Kündigung</em> ... </str>
        </arr>
    </lst>
</lst>

in this case, the hl.q took the highlighting terms from the query itself, and did a superb job..


Solution

  • Check your analyzer chain too. I get the same behaviour as you described, when I misconfigure the chain this way:

      <fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="query">
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
        </analyzer>
        <analyzer type="index">
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
          <filter class="solr.GermanNormalizationFilterFactory"/>
          <filter class="solr.GermanLightStemFilterFactory"/>
        </analyzer>
      </fieldType>
    

    The GermanNormalizationFilterFactory and GermanLightStemFilterFactory both replaces umlauts.