ruby-on-railsrubyruby-on-rails-3elasticsearchtire

Highlighting inconsistency (Tire / ElasticSearch)


I am trying to use Tire (ElasticSearch) with highlighting, but I am experiencing some inconsistencies and I am probably doing something wrong. The problem I got into is that it does not always highlight the possessives for the term I am looking for. Here is the setup:

Indexing:

indexes :thesis,              type: 'string',   boost:  2.0,            analyzer: 'snowball',  as: 'index_clean_thesis'
# the 'index_clean_thesis' removes some formatting characters as \t, \r, \n.

Query:

query { match :thesis, params[:text] } 

I am querying for the term 'Google'.

Now, I have two test entries in my ElasticSearch index (one has an legit text of one of the entries I want to index, while one has some text I made up for testing purposes). On the big text, I am only getting one instance of "Google's" out of around 14 actual, present. On the test text, I am getting all of them.

Here is one instance from the big text where it doesn't highlight "Google's"

Imminent changes to Google’s policies could dramatically lower the

Here is the only instance from the big text where it does highlight "Google's"

I want to ask about Google's pending Toolbar change.

Here is the test text where highlighting works as expected

Google's bla is blabla APPLE google is GOOGLE+ blabla facebook bla is yes yes no Google's ononononono tyeyeeyeyye ete pw iepq kw iqpe iwpq google pqiwop qoweo qpwoe qdpw adpw google's ksowoskwo google+

I also tried the queries through direct curl queries on ElasticSearch but I get the same behavior. Here is the curl query I tried:

curl -XGET http://localhost:9200/postings/_search -d '{
  "query": {
    "match": {
      "thesis": "Google"
    }
  },
  "highlight": {
    "fields": {
      "thesis": {
        "fragment_size": 40,
        "number_of_fragments": 300
      }
    }
  }
}'

Please let me know what am I doing wrong that causes this weird behavior.


Solution

  • Ok, never mind, I just realized what the problem was - it is a bit ridiculous but I am grateful to the StackOverflow code text editor haha: it made me realize that in the examples that it doesn't highlight, there is actually a different apostrophe and probably ElasticSearch doesn't stem it right.

    Sorry for the silly post, but maybe someone will find it useful in the future... I have to specify that the data is input from a form and who knows how that weird apostrophe got in. I am going to filter them out at the object save and put the right apostrophe instead.

    This was a really hard one to get since my text editors don't seem to show a big difference between those 2 apostrophes...

    Thanks,
    Vlad