elasticsearch aggregates some values in a single field

I have some raw data

{
  {
    "id":1,
    "message":"intercept_log,UDP,0.0.0.0,68,255.255.255.255,67"
  },
  {
    "id":2,
    "message":"intercept_log,TCP,172.22.96.4,52085,239.255.255.250,3702,1:"
  },
  {
    "id":3,
    "message":"intercept_log,UDP,1.0.0.0,68,255.255.255.255,67"
  },
  {
    "id":4,
    "message":"intercept_log,TCP,173.22.96.4,52085,239.255.255.250,3702,1:"
  }
}

Demand

I want to group this data by the value of the message part of the message. Output value like that

{
  {
    "GroupValue":"TCP",
    "DocCount":"2"
  },
  {
    "GroupValue":"UDP",
    "DocCount":"2"
  }
}

Try

I have tried with these codes but failed


GET systemevent*/_search
{
  "size": 0, 
  "aggs": {
    "tags": {
      "terms": {
        "field": "message.keyword",
        "include": " intercept_log[,，](.*?)[,，].*?"
      }
    }
  },
  "track_total_hits": true
}

Now I try to use pipelines to meet this need.
"aggs" seems to only group fields.
Does anyone have a better idea?

Link

Terms aggregation

Update

My scene is a little special. I collect logs from many different servers, and then import the logs into es. Therefore, there is a big difference between message fields. If you directly use script statements for grouping statistics, it will result in group failure or inaccurate grouping. I try to filter out some data according to the conditions, and then use script to group the operation code (comment code 1), but this code can't group the correct results.

This is my scene to add:

Our team uses es to analyze the server log, uses rsyslog to forward the data to the server center, and then uses logstash to filter and extract the data to es. At this time, there is a field called message in ES, and the value of message is the detailed log information. At this time, we need to count the data containing some values in the message.

comment code 1

POST systemevent*/_search
{
  "size": 0, 
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase": {
            "message": {
              "query": "intercept_log"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "protocol": {
      "terms": {
        "script": "def values = /,/.split(doc['message.keyword'].value); return values.length > 1 ? values[1] : 'N/A'",
        "size": 10
      }
    }
  },
  "track_total_hits": true
}

comment code 2

POST test2/_search
{
  "size": 0,
  "aggs": {
    "protocol": {
      "terms": {
        "script": "def values = /.*,.*/.matcher( doc['host.keyword'].value ); if( name.matches() ) {return values.group(1) } else { return 'N/A' }",
        "size": 10
      }
    }
  }
}

Solution

The easiest way to solve this is by leveraging scripts in the terms aggregation. The script would simply split on commas and take the second value.

    POST systemevent*/_search
    {
      "size": 0,
      "aggs": {
        "protocol": {
          "terms": {
            "script": "def values = /,/.split(doc['message.keyword'].value); return values.length > 1 ? values[1] : 'N/A';",
            "size": 10
          }
        }
      }
    }

Use Regex

POST test2/_search
{
  "size": 0,
  "aggs": {
    "protocol": {
      "terms": {
        "script": "def m = /.*proto='(.*?)'./.matcher(doc['message.keyword'].value ); if( m.matches() ) { return m.group(1) } else { return 'N/A' }"
      }
    }
  }
}

The results would look like

  "buckets" : [
    {
      "key" : "TCP",
      "doc_count" : 2
    },
    {
      "key" : "UDP",
      "doc_count" : 2
    }
  ]

A better and more efficient way would be to split the message field into new fields using an ingest pipeline or Logstash.