I have a collection of documents which belongs to few authors:
[
{ id: 1, author_id: 'mark', content: [...] },
{ id: 2, author_id: 'pierre', content: [...] },
{ id: 3, author_id: 'pierre', content: [...] },
{ id: 4, author_id: 'mark', content: [...] },
{ id: 5, author_id: 'william', content: [...] },
...
]
I'd like to retrieve and paginate a distinct selection of best matching document based upon the author's id:
[
{ id: 1, author_id: 'mark', content: [...], _score: 100 },
{ id: 3, author_id: 'pierre', content: [...], _score: 90 },
{ id: 5, author_id: 'william', content: [...], _score: 80 },
...
]
Here's what I'm currently doing (pseudo-code):
unique_docs = res.results.to_a.uniq{ |doc| doc.author_id }
Problem is right on pagination: How to select 20 "distinct" documents?
Some people are pointing term facets, but I'm not actually doing a tag cloud:
Thanks,
Adit
As at present ElasticSearch does not provide a group_by equivalent, here's my attempt to do it manually.
While the ES community is working for a direct solution to this problem (probably a plugin) here's a basic attempt which works for my needs.
Assumptions.
I'm looking for relevant content
I've assumed that first 300 docs are relevant, so I consider restricting my research to this selection, regardless many or some of these are from the same few authors.
for my needs I didn't "really" needed full pagination, it was enough a "show more" button updated through ajax.
Drawbacks
results are not precise
as we take 300 docs per time we don't know how many unique docs will come out (possibly could be 300 docs from the same author!). You should understand if it fits your average number of docs per author and probably consider a limit.
you need to do 2 queries (waiting remote call cost):
Here's some ruby pseudo-code: https://gist.github.com/saxxi/6495116