elasticsearchkibana

Average of list elements across many documents/records in ElasticSearch


I'm wondering how to calculate (and build a bar-graph dashboard on top of it) average of list elements across documents/records that I have in ElasticSearch. Let me try to explain with a simple version:

Say I have three documents in ES, with each document having two array fields ('runners' - an array of strings, and 'runners_times' - an array of numbers, where elements in runners and runners_times are sorted so that the first element from the first list corresponds to the first element in the second list, so from document 1: person_a = 100, person_b = 120). Say my three documents/records in ES look like this:

  1. runners: [person_a, person_b], runners_times: [100, 120]
  2. runners: [person_a, person_c], runners_times: [90, 110]
  3. runners: [person_b, person_c], runners_times: [100, 130]

Now, what I want is a bar-graph that gives a list of all unique runners across all three documents (so, in this case, 'person_a', 'person_b', and 'person_c') with their corresponding average times. So, in my case, that would be:

person_a: 95 person_b: 110 person_c: 120

Any tip would be great. Thanks a lot :-)

I'm able to get a list of all unique value in runners, but I'm not sure how to get an average of that person's times, since they are in a separate list.

Should I perhaps try with dictionaries? {'person_a': 100, 'person_b': 120} maybe? I tried that, too, but dictionaries get saved as a list of unfolded fields instead.


Solution

  • You should re-organize your data. Runner and its time must be a nested field with the following mapping

    PUT /runners_reindexed
    {
        "mappings": {
            "properties": {
                "runner_data": {
                    "type": "nested",
                    "properties": {
                        "runner": {
                            "type": "keyword"
                        },
                        "time": {
                            "type": "integer"
                        }
                    }
                }
            }
        }
    }
    

    Put your documents

    POST /runners/_bulk
    {"create":{}}
    {"runners": ["person_a", "person_b"], "runners_times": [100, 120]}
    {"create":{}}
    {"runners": ["person_a", "person_c"], "runners_times": [90, 110]}
    {"create":{}}
    {"runners": ["person_b", "person_c"], "runners_times": [100, 130]}
    

    Then reindex the source index into a new index with name runners_reindexed

    POST _reindex
    {
        "source": {
            "index": "runners"
        },
        "dest": {
            "index": "runners_reindexed"
        },
        "script": {
            "source": """
                    List runners = ctx['_source']['runners'];
                    List runnerTimes = ctx['_source']['runners_times'];
                    
                    List runnersWithTimes = new LinkedList();
                    for (int i = 0; i < runners.size(); i++) {
                        Map runnerData = new HashMap();
                        runnerData['runner'] = runners[i];
                        runnerData['time'] = runnerTimes[i];
                        runnersWithTimes.add(runnerData);
                    }
                    ctx._source[params['runner_with_time_field_name']] =     runnersWithTimes;
            """,
            "params": {
                "runner_with_time_field_name": "runner_data"
            }
        }
    }
    

    It's time to aggregate

    GET /runners_reindexed/_search?filter_path=aggregations.inside_runner_data.by_runner.buckets
    {
        "aggs": {
            "inside_runner_data": {
                "nested": {
                    "path": "runner_data"
                },
                "aggs": {
                    "by_runner": {
                        "terms": {
                            "field": "runner_data.runner",
                            "size": 10
                        },
                        "aggs": {
                            "mean": {
                                "avg": {
                                    "field": "runner_data.time"
                                }
                            }
                        }
                    }
                }
            }
        }
    }
    

    Response

    {
        "aggregations" : {
            "inside_runner_data" : {
                "by_runner" : {
                    "buckets" : [
                        {
                            "key" : "person_a",
                            "doc_count" : 2,
                            "mean" : {
                                "value" : 95.0
                            }
                        },
                        {
                            "key" : "person_b",
                            "doc_count" : 2,
                            "mean" : {
                                "value" : 110.0
                            }
                        },
                        {
                            "key" : "person_c",
                            "doc_count" : 2,
                            "mean" : {
                                "value" : 120.0
                            }
                        }
                    ]
                }
            }
        }
    }