pythonpython-3.xelasticsearchkibanapyelasticsearch

elasticsearch aggregate unique entries of hashes and find all tags


I have two sets of hashes in single index with a different tag as an additional field like below:

Index: hashes-*

sha1-c            tag
abcdefg12345      collect
abcdefg12345      collect
bcdefgh12345      collect
cdefghi           collect


sha1-m       tag
abclefg      bad
abcgefg      bad
cdefghi      bad

How can I find all duplicates between the two tags in my index? I would prefer a datatable in kibana or a visualization but elasticsearch aggregation/query would be sufficient.

FORMAT

t  _index     sha1-c
t  _type      sha1-c
t  sha1-c     cdefghi
t  tag        sha1-c

t  _index     sha1-m
t  _type      sha1-m
t  sha1-m     cdefghi
t  tag        sha1-m  

Expected end state:

hash_matches 
cdefghi

Solution

  • First of all, it's important your two indices have the same field name. Also, for simplicity, the tag and the sha-1c field should have a subfield (or themselves) should be keywords so that a terms aggregation will not trip on some wrongly analyzed text.

    Here's my suggestion (tested with 5.3). In 2.x one might need to change the script just slightly. In 1.x this solution won't work as there is no pipeline aggregations available there.

    The solution is basically creating a terms aggregation on the sha value, then for each sha it's counting the different tags it finds. If those tags count is higher than two then there is sha that has two of them. Now, it would be a more solid solution if we'd know if each sha value is found only once in each index. It is like so? If so, the aggregation on the tag field below should be transformed in an agg on _index field.

    DELETE sha1-*
    PUT sha1-c
    {
      "mappings": {
        "sha1-c": {
          "properties": {
            "sha1-c": {
              "type": "keyword"
            },
            "tag": {
              "type": "keyword"
            }
          }
        }
      }
    }
    PUT sha1-m
    {
      "mappings": {
        "sha1-m": {
          "properties": {
            "sha1-c": {
              "type": "keyword"
            },
            "tag": {
              "type": "keyword"
            }
          }
        }
      }
    }
    
    POST /sha1-c/sha1-c/_bulk
    {"index":{}}
    {"sha1-c":"abcdefg12345","tag":"collect"}
    {"index":{}}
    {"sha1-c":"abcdefg12345","tag":"collect"}
    {"index":{}}
    {"sha1-c":"bcdefgh12345","tag":"collect"}
    {"index":{}}
    {"sha1-c":"cdefghi","tag":"collect"}
    
    POST /sha1-m/sha1-m/_bulk
    {"index":{}}
    {"sha1-c":"abclefg","tag":"bad"}
    {"index":{}}
    {"sha1-c":"abcgefg","tag":"bad"}
    {"index":{}}
    {"sha1-c":"cdefghi","tag":"bad"}
    
    
    GET /sha1-*/_search
    {
      "size": 0, 
      "aggs": {
        "myField": {
          "terms": {
            "field": "sha1-c"
          },
          "aggs": {
            "count_tags": {
              "cardinality": {
                "field": "tag"
              }
            },
            "values_bucket_filter_by_tags_count": {
              "bucket_selector": {
                "buckets_path": {
                  "count": "count_tags"
                },
                "script": "params.count >= 2"
              }
            }
          }
        }
      }
    }