telegrafintermittent

Telegraf intermittent processor.regex


I am having an intermittent issue with the telegraf processors.regex (at least that's my best guess)

We are using the following telegraf configs

inputs.conf

[[inputs.http]]
  urls = [
    "http://myserver.mycompany.com:8080/some/rest/api",
  ]

  username = "user"
  password = "password"

  name_override = "monitor"

  interval = "600s"
  timeout = "3s"

  data_format = "json"
  json_query = "rows"
  json_string_fields = [ "size" ]
  tagexclude = ["host"]

outputs.conf

[[outputs.influxdb]]
  database = "metrics"
  urls = ["http://influxdb.mycompany.com:8086"]

processors.conf

[[processors.converter]]
  [processors.converter.fields]
    integer = [ "size" ]


# Process order is VERY important here
# Rename the url tag to target
[[processors.rename]]
  [[processors.rename.replace]]
    tag = "url"
    dest = "target"

# Extract the target name from the url (I know we just renamed it ... weird)
[[processors.regex]]
  [[processors.regex.tags]]
    key = "url"
    pattern='^http://(?P<target>[^:/]+).+'
    replacement = "${target}"

When I run:

telegraf --config telegraf.conf --config-directory telegraf.d --test --debug --input-filter http

I get back the data I expect and url has been replaced with the regex target i.e.

monitor,target=myserver.mycompany.com size=123456789i 1627647959000000000

The problem is in the grafana graph I have created I see the original full url http://myserver.mycompany.com:8080/some/rest/api rather than the processed myserver.mycompany.com. Also very occasionally when I run the telegraf test I will see target returned with the full unprocessed url i.e.

monitor,target=http://myserver.mycompany.com:8080/some/rest/api size=123456789i 1627647959000000000

The data is correct and has been processed i.e. the size string returned in the json is always converted to int and url is always renamed to target.

Even stranger is I have pushed this config (with different urls in inputs.http depending on the region) to a number of servers and the majority of them work exactly as expected, it's just a few that have this behaviour. I have checked and made sure that all the versions of telegraf on each server match (1.19.1) and they are all running on Centos 7. I have also tried clearing the data from the influxdb.

The few servers that return the url in the target always do so, even though when I run the telegraf test on them they show the host stripped out as they should.

Any hints as to where to look next?


Solution

  • I have found the cause!

    From the telegraf docs.

    The following config parameters are available for all processors:

    order: This is the order in which the processor(s) get executed. If this is not specified then processor execution order will be random.

    Even my comments reveal why it's an issue

    # Process order is VERY important here
    # Rename the url tag to target
    # Extract the target name from the url (I know we just renamed it ... weird)
    

    Yes it is weird, but that was because I happened to keep hitting the same 50:50 chance in my tests but the other order is equally likely. When in the wrong order the key is renamed and the regex has nothing to process on.

    The solution is to use order.

    processors.conf

    [[processors.converter]]
      [processors.converter.fields]
        integer = [ "size" ]
    
    # Extract the target name from the url
    [[processors.regex]]
      order = 1
      [[processors.regex.tags]]
        key = "url"
        pattern='^http://(?P<target>[^:/]+).+'
        replacement = "${target}"
    
    # Rename the url tag to target
    [[processors.rename]]
      order = 2
      [[processors.rename.replace]]
        tag = "url"
        dest = "target"
    

    Now the regex will always run before the rename.