I'm new to ElasticSearch. Previously I've used it only with Django-Haystack, in a very limited fashion, and have never talked to ES directly.
Currently, I have an ElasticSearch (5.x if this matters) index with a few documents. I'm using Python + elasticsearch-dsl + django-elasticsearch-dsl so I'm indexing database models, but it shouldn't really matter. I'll try to leave this question library-agnostic.
Conceptually, I'm storing users, and their posts, all in the same index. The documents for users and for posts have one thing in common - a field user_id
.
Users look like this:
{
"_id": 1,
"_type": "user_document",
"username": "jdoe",
"user_id": 1,
"title": "Test user"
}
And posts are like this:
{
"_id": 1,
"_doc": "post_document",
"user_id": 1,
"title": "Hello world!",
"text": "Lorem ipsum test test test..."
}
What I want my app to implement is a single-input search field that does full-text search over both users and their posts (in real world there are more document "types" - I'm simplifying things here, just for example purposes). And I want to aggregate by user_id
to show just a list of the distinct users that had matched.
Currently, I'm doing query like this:
{
"query": {
"multi_match": {
"query": "test",
"fields": ["username^3", "title^2", "text"]
}
},
"aggs": {
"user_ids": {"terms": {"field": "user_id"}}
}
}
Then using response's aggregations.user_ids.buckets.key
to obtain a list of matching users.
However, that list seems to be simply ordered by document count (so if user has a pair of posts with the word "test" they seem to win over the user named "test"), and I want to experiment with ordering. My current idea is to use an average (or a median value) document match _score
.
Note: in real situation there are more than just two document types, so taking a shortcut and querying just over a specific _type
won't work.
How can I do this? I'm reading the "Sorting by a Metric" chapter, but the ideas there are somewhat lost on me. I made a few attempts but they were basically nonsense. Can anyone please show a concrete query example (very much preferably, with explanation how it was constructed), so I can learn from it?
Here is the Gist with an example dataset, the search query shown above, and the exact results I'm getting. What I want (in test_query_01_results.json
) is to have user_id
1 be prioritized over 2, with the logic that 2.0794415 > (0.78306973 + 0.45315093) / 2.
Another thing that I feel I'm doing wrong is that I don't use hits
at all - I just don't need them - only the aggregated user_id
values. If this is okay - is there a way to "disable" them and only return aggregations?
Use following query
{
"size": 0 , ==> to return no hits
"query": { ==> query similar to yours
"multi_match": {
"query": "test",
"fields": ["username^3", "title^2", "text"]
}
},
"aggs": {
"user_ids": {
"terms": {
"field": "user_id",
"order": {"avg_score": "desc"}
},
"aggs": {
"avg_score": {
"avg": {"script": "_score"}
}
}
}
}
}