pythonelasticsearchhighlightelasticsearch-queryelasticsearch-highlight

With Elasticsearch, can I highlight with different HTML tags for different matched tokens?


Learning ES at the moment, but I'm very keen to implement this.

I know you can highlight different fields with different tags, using the pre_tags and post_tags keys of highlight in a query... but is it possible to delivery a marked-up string where the returned fragment has a different HTML colour tag for each separate identified word, e.g. using a simple query string?

So I query with "interesting data" and a document field is returned like so:

the other day I was walking through the woods and I had an <font color="blue">interesting</font> 
thought about some <font color="red">data</font>

What I'm getting at is not simply that the tags alternate "mindlessly": again, you can do with Fast Vector Highlighter, e.g.:

"highlight": {
    "fields": {
        "description": {
            "pre_tags": ["<b>", "<em>"],
            "post_tags": ["</b>", "</em>"]

Instead, I would like the field

"the other data day data was walking through some interesting woods and data had an interesting thought about some data"

to be returned thus:

the other <font color="red">data</font> day <font color="red">data</font> was walking through some <font color="blue">
interesting</font> woods and <font color="red">data</font> had an <font color="blue">
interesting</font> thought about some <font color="red">data</font>

I've previously coded using Lucene, i.e. Java, and I did manage to implement this sort of thing, by majorly jumping through hoops.

NB one answer to this might be "forget about ES returning marked up text, just apply your own tags using re.sub( r'\bdata\b', '<font color="red">data</font>', field_string )".

This would be OK for a simple use-case like this. But it doesn't work with a stemmer analyser. E.g., to give a French example: search query is "changer élément". I want the following marked-up result:

Les autres <font color="red">éléments</font> ont été <font color="blue">
changés</font> car on a appliqué un <font color="blue">changement</font> 
à chaque <font color="red">élément</font>

i.e. "changer", "changés" and "changement" all stem to "chang", and "élément" and "éléments" stem to "element". A standard highlighted return of this field would thus be:

Les autres <em>éléments</em> ont été <em>changés</em> car on a appliqué un 
<em>changement</em> à chaque <em>élément</em>

Solution

  • The fast vector highlighter is a good place to start. I haven't worked w/ French yet so don't consider the following authoritative but based on the built-in french analyzer, we could do something like this:

    PUT multilang_index
    {
      "mappings": {
        "properties": {
          "description": {
            "type": "text",
            "term_vector": "with_positions_offsets",
            "fields": {
              "french": {
                "type": "text",
                "analyzer": "french",
                "term_vector": "with_positions_offsets"
              }
            }
          }
        }
      }
    }
    

    FYI the french analyzer could be reimplemented/extended as shown here.

    After ingesting the English & French examples:

    POST multilang_index/_doc
    {
      "description": "the other data day data was walking through some interesting woods and data had an interesting thought about some data"
    }
    
    POST multilang_index/_doc
    {
      "description": "Les autres éléments ont été changés car on a appliqué un changement à chaque élément"
    }
    

    We can query for interesting data like so:

    POST multilang_index/_search
    {
      "query": {
        "simple_query_string": {
          "query": "interesting data",
          "fields": [
            "description"
          ]
        }
      },
      "highlight": {
        "fields": {
          "description": {
           "type": "fvh",
           "pre_tags": ["<font color=\"red\">", "<font color=\"blue\">"],
           "post_tags": ["</font>", "</font>"]
          }
        },
        "number_of_fragments": 0
      }
    }
    

    yielding

    the other <font color="blue">data</font> day <font color="blue">data</font> 
    was walking through some <font color="red">interesting</font> woods and 
    <font color="blue">data</font> had an <font color="red">interesting</font>
    thought about some <font color="blue">data</font>
    

    and analogously for changer élément:

    POST multilang_index/_search
    {
      "query": {
        "simple_query_string": {
          "query": "changer élément",
          "fields": [
            "description.french"
          ]
        }
      },
      "highlight": {
        "fields": {
          "description.french": {
           "type": "fvh",
           "pre_tags": ["<font color=\"red\">", "<font color=\"blue\">"],
           "post_tags": ["</font>", "</font>"]
          }
        },
        "number_of_fragments": 0
      }
    }
    

    yielding

    Les autres <font color="blue">éléments</font> ont été 
    <font color="red">changés</font> car on a appliqué un 
    <font color="red">changement</font> à chaque <font color="blue">élément</font>
    

    which, to me, looks correctly stemmed.


    Note that the pre_tags order is enforced based on what token inside of the simple_query_string query matches first. When querying for changer élément, the shingle éléments in the description is matched first but what caused it to match is the 2nd token (élément), thereby the blue html tag instead of the red.