elasticsearch

Need help trying to "Pivot" data with elasticsearch results


I have a large dataset that consists of documents. Each document has searchable text and a number of metadata fields associated with it. I need to create a summary of a given set of search results within a web application and I'm hoping I'm able to achieve it with Elastic Search (maybe a pivot transform or aggregation?). Often there are thousands of results that match a given query, but we limit the returned data to around 10 results at a time. Paginating gets the next set of results. Currently, in order to get a summary, I repeatedly run the query in batches until I have all the data that matches a given query, and then build up a summary object on the web app side with JavaScript. Since there can be many thousands of results to a given query, the batching process can take a very long time.

What I'm hoping is to group all of the results from a given query by document type. Then for each document type, for each of it's metadata fields, show values that occur in over x% of results. say x is 30% and I have a search that returns these documents:

{
    "document_type": "A",
    "metadata_field_1": "b",
    "metadata_field_2": "i"
},
{
    "document_type": "A",
    "metadata_field_1": "b",
    "metadata_field_2": "iii"
},
{
    "document_type": "A",
    "metadata_field_1": "c",
    "metadata_field_2": "iii"
},
{
    "document_type": "A",
    "metadata_field_1": "c",
    "metadata_field_2": "i"
},
{
    "document_type": "A",
    "metadata_field_1": "a",
    "metadata_field_2": "ii"
},
{
    "document_type": "B",
    "metadata_field_1": "a",
    "metadata_field_2": "ii"
}

When a user requests a summary, I want to return something like this

{
    "document_type": "A",
    "results": 5,
    "commonality_field_1": ["b", "c"],
    "commonality_field_2": ["iii", "i"]
},
{
    "document_type": "B",
    "results": 1,
    "commonality_field_1": ["a"],
    "commonality_field_2": ["ii"]
}

I am far from an Elasticsearch expert so hoping someone might be able to help point me toward an efficient way to achieve this.


Solution

  • Tldr;

    You might not need to use pivot at all, you could be using aggregation to compute those information on the fly.

    Solution

    set up

    POST 79440417/_bulk
    {"index": {}}
    {"document_type": "A","metadata_field_1": "b","metadata_field_2": "i"}
    {"index": {}}
    {"document_type": "A","metadata_field_1": "b","metadata_field_2": "iii"}
    {"index": {}}
    {"document_type": "A","metadata_field_1": "c","metadata_field_2": "iii"}
    {"index": {}}
    {"document_type": "A","metadata_field_1": "c","metadata_field_2": "i"}
    {"index": {}}
    {"document_type": "A","metadata_field_1": "a","metadata_field_2": "ii"}
    {"index": {}}
    {"document_type": "B","metadata_field_1": "a","metadata_field_2": "ii"}
    

    with aggregation

    GET 79440417/_search
    {
      "aggs": {
        "doc_type": {
          "terms": {
            "field": "document_type.keyword"
          },
          "aggs": {
            "summary_field_1": {
              "terms": {
                "field": "metadata_field_1.keyword"
              }
            },
            "summary_field_2": {
              "terms": {
                "field": "metadata_field_2.keyword"
              }
            }
          }
        }
      }
    }
    

    With pivot

    That would look like that:

    PUT _transform/79440417_transform
    {
      "source": {
        "index": "79440417",
        "query": {
          "match_all": {}
        }
      },
      "pivot": {
        "group_by": {
          "document_type": {
            "terms": {
              "field": "document_type.keyword"
            }
          }
        },
        "aggregations": {
          "commonality_field_1":{
            "terms": {
              "field": "metadata_field_1.keyword"
            }
          },
          "commonality_field_2":{
            "terms": {
              "field": "metadata_field_2.keyword"
            }
          }
        }
      },
      "description": "",
      "dest": {
        "index": "79440417_transform"
      },
      "frequency": "5m"
    }