I have a large dataset that consists of documents. Each document has searchable text and a number of metadata fields associated with it. I need to create a summary of a given set of search results within a web application and I'm hoping I'm able to achieve it with Elastic Search (maybe a pivot transform or aggregation?). Often there are thousands of results that match a given query, but we limit the returned data to around 10 results at a time. Paginating gets the next set of results. Currently, in order to get a summary, I repeatedly run the query in batches until I have all the data that matches a given query, and then build up a summary object on the web app side with JavaScript. Since there can be many thousands of results to a given query, the batching process can take a very long time.
What I'm hoping is to group all of the results from a given query by document type. Then for each document type, for each of it's metadata fields, show values that occur in over x% of results. say x is 30% and I have a search that returns these documents:
{
"document_type": "A",
"metadata_field_1": "b",
"metadata_field_2": "i"
},
{
"document_type": "A",
"metadata_field_1": "b",
"metadata_field_2": "iii"
},
{
"document_type": "A",
"metadata_field_1": "c",
"metadata_field_2": "iii"
},
{
"document_type": "A",
"metadata_field_1": "c",
"metadata_field_2": "i"
},
{
"document_type": "A",
"metadata_field_1": "a",
"metadata_field_2": "ii"
},
{
"document_type": "B",
"metadata_field_1": "a",
"metadata_field_2": "ii"
}
When a user requests a summary, I want to return something like this
{
"document_type": "A",
"results": 5,
"commonality_field_1": ["b", "c"],
"commonality_field_2": ["iii", "i"]
},
{
"document_type": "B",
"results": 1,
"commonality_field_1": ["a"],
"commonality_field_2": ["ii"]
}
I am far from an Elasticsearch expert so hoping someone might be able to help point me toward an efficient way to achieve this.
You might not need to use pivot at all, you could be using aggregation to compute those information on the fly.
POST 79440417/_bulk
{"index": {}}
{"document_type": "A","metadata_field_1": "b","metadata_field_2": "i"}
{"index": {}}
{"document_type": "A","metadata_field_1": "b","metadata_field_2": "iii"}
{"index": {}}
{"document_type": "A","metadata_field_1": "c","metadata_field_2": "iii"}
{"index": {}}
{"document_type": "A","metadata_field_1": "c","metadata_field_2": "i"}
{"index": {}}
{"document_type": "A","metadata_field_1": "a","metadata_field_2": "ii"}
{"index": {}}
{"document_type": "B","metadata_field_1": "a","metadata_field_2": "ii"}
GET 79440417/_search
{
"aggs": {
"doc_type": {
"terms": {
"field": "document_type.keyword"
},
"aggs": {
"summary_field_1": {
"terms": {
"field": "metadata_field_1.keyword"
}
},
"summary_field_2": {
"terms": {
"field": "metadata_field_2.keyword"
}
}
}
}
}
}
That would look like that:
PUT _transform/79440417_transform
{
"source": {
"index": "79440417",
"query": {
"match_all": {}
}
},
"pivot": {
"group_by": {
"document_type": {
"terms": {
"field": "document_type.keyword"
}
}
},
"aggregations": {
"commonality_field_1":{
"terms": {
"field": "metadata_field_1.keyword"
}
},
"commonality_field_2":{
"terms": {
"field": "metadata_field_2.keyword"
}
}
}
},
"description": "",
"dest": {
"index": "79440417_transform"
},
"frequency": "5m"
}