I have two sets of hashes in single index with a different tag as an additional field like below:
Index: hashes-*
sha1-c tag
abcdefg12345 collect
abcdefg12345 collect
bcdefgh12345 collect
cdefghi collect
sha1-m tag
abclefg bad
abcgefg bad
cdefghi bad
How can I find all duplicates between the two tags in my index? I would prefer a datatable in kibana or a visualization but elasticsearch aggregation/query would be sufficient.
FORMAT
t _index sha1-c
t _type sha1-c
t sha1-c cdefghi
t tag sha1-c
t _index sha1-m
t _type sha1-m
t sha1-m cdefghi
t tag sha1-m
Expected end state:
hash_matches
cdefghi
First of all, it's important your two indices have the same field name. Also, for simplicity, the tag
and the sha-1c
field should have a subfield (or themselves) should be keyword
s so that a terms
aggregation will not trip on some wrongly analyzed text.
Here's my suggestion (tested with 5.3). In 2.x one might need to change the script just slightly. In 1.x this solution won't work as there is no pipeline aggregations available there.
The solution is basically creating a terms
aggregation on the sha value, then for each sha it's counting the different tags it finds. If those tags count is higher than two then there is sha that has two of them. Now, it would be a more solid solution if we'd know if each sha value is found only once in each index. It is like so? If so, the aggregation on the tag
field below should be transformed in an agg on _index
field.
DELETE sha1-*
PUT sha1-c
{
"mappings": {
"sha1-c": {
"properties": {
"sha1-c": {
"type": "keyword"
},
"tag": {
"type": "keyword"
}
}
}
}
}
PUT sha1-m
{
"mappings": {
"sha1-m": {
"properties": {
"sha1-c": {
"type": "keyword"
},
"tag": {
"type": "keyword"
}
}
}
}
}
POST /sha1-c/sha1-c/_bulk
{"index":{}}
{"sha1-c":"abcdefg12345","tag":"collect"}
{"index":{}}
{"sha1-c":"abcdefg12345","tag":"collect"}
{"index":{}}
{"sha1-c":"bcdefgh12345","tag":"collect"}
{"index":{}}
{"sha1-c":"cdefghi","tag":"collect"}
POST /sha1-m/sha1-m/_bulk
{"index":{}}
{"sha1-c":"abclefg","tag":"bad"}
{"index":{}}
{"sha1-c":"abcgefg","tag":"bad"}
{"index":{}}
{"sha1-c":"cdefghi","tag":"bad"}
GET /sha1-*/_search
{
"size": 0,
"aggs": {
"myField": {
"terms": {
"field": "sha1-c"
},
"aggs": {
"count_tags": {
"cardinality": {
"field": "tag"
}
},
"values_bucket_filter_by_tags_count": {
"bucket_selector": {
"buckets_path": {
"count": "count_tags"
},
"script": "params.count >= 2"
}
}
}
}
}
}