[SOLVED] distinct count on hive does not match cardinality count on elasticsearch

distinct count on hive does not match cardinality count on elasticsearch

I have loaded data into my elasticsearch cluster from hive using the elasticsearch-hadoop plugin from elastic.

I need to fetch a count of unique account numbers. I have the following queries written in both hql and queryDSL, BUT they are returning different counts.

Hive Query:

select count(distinct account) from <tableName> where capacity="550";

// Returns --> 71132

Similarly, in Elasticsearch the query looks like this:

{
    "query": {
        "bool": {
            "must": [
              {"match": { "capacity": "550"}}
            ]
        }
    },
    "aggs": {
    "unique_account": {
      "cardinality": {
        "field": "account"
      }
    }
  }
}

// Returns --> 71607

Am I doing something wrong? What can I do to match the two queries?

Note: There are exactly the same number of records in hive and elasticsearch.

Solution

"the first approximate aggregation provided by Elasticsearch is the cardinality metric
...
As mentioned at the top of this chapter, the cardinality metric is an approximate algorithm. It is based on the HyperLogLog++ (HLL) algorithm."

https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html

For the OP

precision_threshold

"precision_threshold accepts a number from 0–40,000. Larger values are treated as equivalent to 40,000.
...
Although not guaranteed by the algorithm, if a cardinality is under the threshold, it is almost always 100% accurate. Cardinalities above this will begin to trade accuracy for memory savings, and a little error will creep into the metric."

https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html

You might also want to take a look at "Support for precise cardinality aggregation #15876"

For the OP, 2

"I have tried several numbers..."

You have 71,132 distinct values while the precision threshold limit is 40,000, therefore the cardinality is over the threshold, which means accuracy is traded for memory saving.
This is how the chosen implementation (based on HyperLogLog++ algorithm) works.