pythonelasticsearchelasticsearch-phonetic

Elasticsearch return phonetic token with search


I use the phonetic analysis plugin from elastic search to do some string matching thanks to phonetic transformation.

My problem is, how to get phonetic transformation processed by elastic search in the result of the query?.

First, I create an index with a metaphone transformation:

request_body = {
    'settings': {
        'index': {
            'analysis': {
                'analyzer': {
                    'metaphone_analyzer': {
                        'tokenizer':
                        'standard',
                        'filter': [
                            'ascii_folding_filter', 'lowercase',
                            'metaphone_filter'
                        ]
                    }
                },
                'filter': {
                    'metaphone_filter': {
                        'type': 'phonetic',
                        'encoder': 'metaphone',
                        'replace': False
                    },
                    'ascii_folding_filter': {
                        'type': 'asciifolding',
                        'preserve_original': True
                    }
                }
            }
        }
    },
    'mappings': {
        'person_name': {
            'properties': {
                'full_name': {
                    'type': 'text',
                    'fields': {
                        'metaphone_field': {
                            'type': 'string',
                            'analyzer': 'metaphone_analyzer'
                        }
                    }
                }
            }
        }
    }
}

res = es.indices.create(index="my_index", body=request_body)

Then, I add some data:

# Add some data
names = [{
    "full_name": "John Doe"
}, {
    "full_name": "Bob Alice"
}, {
    "full_name": "Foo Bar"
}]

for name in names:
    res = es.index(index="my_index",
                   doc_type='person_name',
                   body=name,
                   refresh=True)

And finally, I query a name:

es.search(index="my_index",
          body={
              "size": 5,
              "query": {
                  "multi_match": {
                      "query": "Jon Doe",
                      "fields": "*_field"
                  }
              }
          })

Search returns:

{
    'took': 1,
    'timed_out': False,
    '_shards': {
        'total': 5,
        'successful': 5,
        'skipped': 0,
        'failed': 0
    },
    'hits': {
        'total':
        1,
        'max_score':
        0.77749264,
        'hits': [{
            '_index': 'my_index',
            '_type': 'person_name',
            '_id': 'AWwYjl4Mqo63y_hLp5Yl',
            '_score': 0.77749264,
            '_source': {
                'full_name': 'John Doe'
            }
        }]
    }
}

In the search return I would like to get the phonetic transformation of the names in elastic search (also from the query name but it is less important) when I execute the search.

I know, that I could use explain API but I would like to avoid a 2nd request, and moreover the explain API seems a little "overkill" for what I want to achieve.

Thanks !


Solution

  • It doesn't look like an easy thing to implement in an Elasticsearch query, but you could try analyze API and scripted fields with fielddata enabled, and term vectors might come handy. Here's how.

    Retrieve tokens from an arbitrary query

    Analyze API is a great tool if you want to understand how exactly does Elasticsearch tokenize your query.

    Using your mapping you could do, for example:

    GET myindex/_analyze
    {
      "analyzer": "metaphone_analyzer",
      "text": "John Doe"
    }
    

    And get something like this as a result:

    {
      "tokens": [
        {
          "token": "JN",
          "start_offset": 0,
          "end_offset": 4,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "john",
          "start_offset": 0,
          "end_offset": 4,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "T",
          "start_offset": 5,
          "end_offset": 8,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "doe",
          "start_offset": 5,
          "end_offset": 8,
          "type": "<ALPHANUM>",
          "position": 1
        }
      ]
    }
    

    This is technically a different query, but still might be useful.

    Retrieve tokens from a field of a document

    In theory, we could try to retrieve the very same tokens which analyze API returned in the previous section, from the documents matched by our query.

    In practice Elasticsearch will not store the tokens of a text field it has just analyzed: fielddata is disabled by default. We need to enable it:

    PUT /myindex
    {
      "mappings": {
        "person_name": {
          "properties": {
            "full_name": {
              "fields": {
                "metaphone_field": {
                  "type": "text", 
                  "analyzer": "metaphone_analyzer",
                  "fielddata": true
                }
              }, 
              "type": "text"
            }
          }
        }
      }, 
      "settings": {
        ...
      }
    }
    

    Now, we can use scripted fields to ask Elasticsearch to return those tokens.

    The query might look like this:

    POST myindex/_search
    {
      "script_fields": {
        "my tokens": {
          "script": {
            "lang": "painless",
            "source": "doc[params.field].values",
            "params": {
              "field": "full_name.metaphone_field"
            }
          }
        }
      }
    }
    

    And the response would look like this:

    {
      "hits": {
        "total": 1,
        "max_score": 1,
        "hits": [
          {
            "_index": "myindex",
            "_type": "person_name",
            "_id": "123",
            "_score": 1,
            "fields": {
              "my tokens": [
                "JN",
                "T",
                "doe",
                "john"
              ]
            }
          }
        ]
      }
    }
    

    As you can see, the very same tokens (but in random order).

    Can we retrieve also the information about location of these tokens in the document?

    Retrieving tokens with their positions

    term vectors may help. To be able to use them we actually don't need fielddata enabled. We could lookup term vectors for a document:

    GET myindex/person_name/123/_termvectors
    {
      "fields" : ["full_name.metaphone_field"],
      "offsets" : true,
      "positions" : true
    }
    

    This would return something like this:

    {
      "_index": "myindex",
      "_type": "person_name",
      "_id": "123",
      "_version": 1,
      "found": true,
      "took": 1,
      "term_vectors": {
        "full_name.metaphone_field": {
          "field_statistics": {
            "sum_doc_freq": 4,
            "doc_count": 1,
            "sum_ttf": 4
          },
          "terms": {
            "JN": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 0,
                  "start_offset": 0,
                  "end_offset": 4
                }
              ]
            },
            "T": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 1,
                  "start_offset": 5,
                  "end_offset": 8
                }
              ]
            },
            "doe": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 1,
                  "start_offset": 5,
                  "end_offset": 8
                }
              ]
            },
            "john": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 0,
                  "start_offset": 0,
                  "end_offset": 4
                }
              ]
            }
          }
        }
      }
    }
    

    This gives a way to get the tokens of a field of a document like the analyzer produced them.

    Unfortunately, as of my knowledge, there is no way to combine these three queries into a single one. Also fielddata should be used with caution since it uses a lot of memory.


    Hope this helps!