Elasticsearch has a built-in "highlight" function which allows you to tag the matched terms in the results (more complicated than it might at first sound, since the query syntax can include near matches etc.).
I have HTML fields, and Elasticsearch stomps all over the HTML syntax when I turn on highlighting.
Can I make it HTML-aware / HTML-safe when highlighting in this way?
I'd like the highlighting to apply to the text in the HTML document, and not to highlight any HTML markup which has matched the search, i.e. a search for "p" might highlight <p>p</p>
-> <p><mark>p</mark></p>
.
My fields are indexed as "type: string
".
The documentation says:
Encoder:
An encoder parameter can be used to define how highlighted text will be encoded. It can be either default (no encoding) or html (will escape html, if you use html highlighting tags).
.. but that HTML-escapes my already HTML-encoded field, breaking things further.
Here are two example queries
The highlight tags are inserted inside other tags, i.e. <p>
-> <<tag1>p</tag1>>
:
curl -XPOST -H 'Content-type: application/json' "http://localhost:7200/myindex/_search?pretty" -d '
{
"query": { "match": { "preview_html": "p" } },
"highlight": {
"pre_tags" : ["<tag1>"],
"post_tags" : ["</tag1>"],
"encoder": "default",
"fields": {
"preview_html" : {}
}
},
"from" : 22, "size" : 1
}'
GIVES:
...
"highlight" : {
"preview_html" : [ "<<tag1>p</tag1> class=\"text\">TOP STORIES</<tag1>p</tag1>><<tag1>p</tag1> class=\"text\">Middle East</<tag1>p</tag1>><<tag1>p</tag1> class=\"text\">Syria: Developments in Syria are main story in Middle East</<tag1>p</tag1>>" ]
}
...
html
encoder:The existing HTML syntax is escaped by elasticsearch, which breaks things, i.e. <p>
-> <<tag1>p</tag1>>
:
curl -XPOST -H 'Content-type: application/json' "http://localhost:7200/myindex/_search?pretty" -d '
{
"query": { "match": { "preview_html": "p" } },
"highlight": {
"pre_tags" : ["<tag1>"],
"post_tags" : ["</tag1>"],
"encoder": "html",
"fields": {
"preview_html" : {}
}
},
"from" : 22, "size" : 1
}'
GIVES:
...
"highlight" : {
"preview_html" : [ "<<tag1>p</tag1> class="text">TOP STORIES</<tag1>p</tag1>><<tag1>p</tag1> class="text">Middle East</<tag1>p</tag1>><<tag1>p</tag1> class="text">Syria: Developments in Syria are main story in Middle East</<tag1>p</tag1>>" ]
}
}
...
One way to achieve this is to use the html_strip char filter while analyzing preview_html
field.
This would ensure that while matches would not occur on html markup and hence highlighting would ignore it to as shown in the example below.
Example:
put test
{
"settings": {
"index": {
"analysis": {
"char_filter": {
"my_html": {
"type": "html_strip"
}
},
"analyzer": {
"my_html": {
"tokenizer": "standard",
"char_filter": [
"my_html"
],
"type": "custom"
}
}
}
}
}
}
put test/test/_mapping
{
"properties": {
"preview_html": {
"type": "string",
"analyzer": "my_html",
"search_analyzer": "standard"
}
}
}
put test/test/1
{
"preview_html": "<p> p </p>"
}
post test/test/_search
{
"query": {
"match": {
"preview_html": "p"
}
},
"highlight": {
"fields": {
"preview_html": {}
}
}
}
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.30685282,
"_source": {
"preview_html": "<p> p </p>"
},
"highlight": {
"preview_html": [
"<p> <em>p</em> </p>"
]
}
}
]