pythonelasticsearchpyelasticsearch

Matching document by ElasticSearch's Percolate API always returns no matches if registered queries contain terms


I try to use Percolator by Elasticsearch and I have a minor issue.

Suppose our document looks like this:

{
    "doc": {
        "full_name": "Pacman"
        "company": "Arcade Game LTD",
        "occupation": "hunter", 
        "tags": ["Computer Games"]
    }
}

And our registered query like this:

{
    "query": {
        "bool": {
            "must": [
               {
                   "match_phrase":{
                       "occupation":  "hunter"
                   }
               },
               {
                   "terms": {
                       "tags":  [
                           "Computer Games",
                           "Electronic Sports"
                           ],
                       "minimum_match": 1
                   }
               }
            ]
        }
    }
}

I get:

{
   "took": 3,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "total": 0,
   "matches": []
}

and I don't know what I'm doing wrong, because if I remove terms from registered query and just match by occupation it works as expected and I get one match.

Any hints?

Update 1

OK, I think that @Slam's solution is the right direction, but I still have some issues:

I updated my mapping for tags, so it now looks like this:

"tags": {
    "store": True,
    "analyzer": "snowball",
    "type": "string",
    "index": "analyzed",
    "fields": {
        "raw": {
           "type": "string",
           "index": "not_analyzed"
       }
    }
}

New document to percolate:

{
    "doc": {
        "full_name": "Pacman"
        "company": "Arcade Game LTD",
        "occupation": "hunter", 
        "tags.raw": ["Computer Games"]
    }
}

And when I try to match document above with tags.raw, still no matches are found. I analyzed field tags.raw but it looks like it still creates tokens computer, games and running.


Solution

  • I guess, you use implicit mapping (default analyzer) or any type of analyzer for your tags field. That means, that data ("Computer Games" in your case) is broken to token parts and no longer available for terms search, as now its represented as something like computer+game in index.

    To be able to do term matching for strings, you need either map them as non-analyzed (to prevent them to be sliced to tokens) like

    PUT so/pacman/_mapping
    {
      "pacman": {
        "properties": {
          "tags": {
            "type": "string",
            "index": "not_analyzed"
          }
        }
      }
    }
    

    or make your tags field multi-field, like

    PUT so/pacman/_mapping
    {
      "pacman": {
        "properties": {
          "tags": {
            "type": "string",
            "index": "analyzed",
            "fields": {
              "raw": {
                "type": "string",
                "index": "not_analyzed"
              }
            }
          }
        }
      }
    }
    

    and query documents with

    GET so/pacman/_search
    {
      "query": {
        "terms": {
          "tags.raw": [
            "Computer Games",
            "Running"
          ],
          "minimum_match": 1
        }
      }
    }
    

    Such approach let you perform text search and term searches.

    According to your Update 1, after you've put the correct mapping and percolator like:

    PUT so/.percolator/1
    {
      "query": {
        "terms": {
          "tags.raw": [
            "Computer Games",
            "Maze running"
          ]
        }
      }
    }
    

    you need to index/percolate documents with format like

    GET so/pacman/_percolate
    {
      "doc": {
        "full_name": "Pacman",
        "company": "Arcade Game LTD",
        "occupation": "hunter", 
        "tags": ["Computer Games"]
      }
    }
    

    What is happening here. You're indexing/percolation document with field tags (without any mention of raw or whatever multifield you have). ES take this field from json, adds tags.raw to index (as whole string), and at the same time brake it down to analyzed tokens, and put them in tag field (the process is much more complicated, but lets pass it for the sake of simplicity here). So, you don't need to manage any internal things about this field, you've done that in your mapping.

    And when percolator works, it will look for tags.raw field in index (because you created terms query for this "subfield") leaving the analyzed one untouched.