sortingsearchelasticsearchaggregationrelevance

Elasticsearch order terms aggregate by score


I'm new to ElasticSearch. Previously I've used it only with Django-Haystack, in a very limited fashion, and have never talked to ES directly.

Currently, I have an ElasticSearch (5.x if this matters) index with a few documents. I'm using Python + elasticsearch-dsl + django-elasticsearch-dsl so I'm indexing database models, but it shouldn't really matter. I'll try to leave this question library-agnostic.

Conceptually, I'm storing users, and their posts, all in the same index. The documents for users and for posts have one thing in common - a field user_id.

Users look like this:

{
    "_id": 1,
    "_type": "user_document",
    "username": "jdoe",
    "user_id": 1,
    "title": "Test user"
}

And posts are like this:

{
    "_id": 1,
    "_doc": "post_document",
    "user_id": 1,
    "title": "Hello world!",
    "text": "Lorem ipsum test test test..."
}

What I want my app to implement is a single-input search field that does full-text search over both users and their posts (in real world there are more document "types" - I'm simplifying things here, just for example purposes). And I want to aggregate by user_id to show just a list of the distinct users that had matched.

Currently, I'm doing query like this:

{
    "query": {
        "multi_match": {
            "query": "test",
            "fields": ["username^3", "title^2", "text"]
        }
    },
    "aggs": {
        "user_ids": {"terms": {"field": "user_id"}}
    }
}

Then using response's aggregations.user_ids.buckets.key to obtain a list of matching users.

However, that list seems to be simply ordered by document count (so if user has a pair of posts with the word "test" they seem to win over the user named "test"), and I want to experiment with ordering. My current idea is to use an average (or a median value) document match _score.

Note: in real situation there are more than just two document types, so taking a shortcut and querying just over a specific _type won't work.

How can I do this? I'm reading the "Sorting by a Metric" chapter, but the ideas there are somewhat lost on me. I made a few attempts but they were basically nonsense. Can anyone please show a concrete query example (very much preferably, with explanation how it was constructed), so I can learn from it?

Here is the Gist with an example dataset, the search query shown above, and the exact results I'm getting. What I want (in test_query_01_results.json) is to have user_id 1 be prioritized over 2, with the logic that 2.0794415 > (0.78306973 + 0.45315093) / 2.

Another thing that I feel I'm doing wrong is that I don't use hits at all - I just don't need them - only the aggregated user_id values. If this is okay - is there a way to "disable" them and only return aggregations?


Solution

  • Use following query

    {
    "size": 0 ,                    ==> to return no hits
    "query": {                     ==> query similar to yours
        "multi_match": {
            "query": "test",
            "fields": ["username^3", "title^2", "text"]
        }
    },
    "aggs": {
        "user_ids": {
            "terms": {
                "field": "user_id",
                "order": {"avg_score": "desc"}
            },
            "aggs": {
                "avg_score": {
                    "avg": {"script": "_score"}
                  }
              }
          }
        }
      }