[SOLVED] select distinct from elasticsearch

select distinct from elasticsearch

I have a collection of documents which belongs to few authors:

[
  { id: 1, author_id: 'mark', content: [...] },
  { id: 2, author_id: 'pierre', content: [...] },
  { id: 3, author_id: 'pierre', content: [...] },
  { id: 4, author_id: 'mark', content: [...] },
  { id: 5, author_id: 'william', content: [...] },
  ...
]

I'd like to retrieve and paginate a distinct selection of best matching document based upon the author's id:

[
  { id: 1, author_id: 'mark', content: [...], _score: 100 },
  { id: 3, author_id: 'pierre', content: [...], _score: 90 },
  { id: 5, author_id: 'william', content: [...], _score: 80 },
  ...
]

Here's what I'm currently doing (pseudo-code):

unique_docs = res.results.to_a.uniq{ |doc| doc.author_id }

Problem is right on pagination: How to select 20 "distinct" documents?

Some people are pointing term facets, but I'm not actually doing a tag cloud:

Thanks,
Adit

Solution

As at present ElasticSearch does not provide a group_by equivalent, here's my attempt to do it manually.
While the ES community is working for a direct solution to this problem (probably a plugin) here's a basic attempt which works for my needs.

Assumptions.

I'm looking for relevant content
I've assumed that first 300 docs are relevant, so I consider restricting my research to this selection, regardless many or some of these are from the same few authors.
for my needs I didn't "really" needed full pagination, it was enough a "show more" button updated through ajax.

Drawbacks

results are not precise
as we take 300 docs per time we don't know how many unique docs will come out (possibly could be 300 docs from the same author!). You should understand if it fits your average number of docs per author and probably consider a limit.
you need to do 2 queries (waiting remote call cost):
- first query asks for 300 relevant docs with just these fields: id & author_id
- retrieve full docs of paginated ids in a second query

Here's some ruby pseudo-code: https://gist.github.com/saxxi/6495116