pythoninfluxdbtelegraftelegraf-inputs-plugintelegraf-plugins

How do I troubleshoot this error in telegraf?


I have a custom python plugin that I am using to pull data into Telegraf. It prints out line protocol output, as expected.

In my Ubuntu 18.04 environment, when this plugin is run I see a single line in my logs:

2020-12-28T21:55:00Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command '/my_company/plugins-enabled/plugin-mysystem/poll_mysystem.py': Traceback (most recent call last):...

That is it. I can't figure out how to get the actual traceback.

If I run sudo -u telegraf /usr/bin/telegraf -config /etc/telegraf/telegraf.conf, the plugin works as expected. It polls and loads data exactly as it should.

I'm not sure how to move forward with troubleshooting this error when telegraf is executing the plugin on it's own.

I have restarted the telegraf service. I have verified permissions (and I think that the execution above shows that it should work).

A few additional details based on the comments and answers received:

Plugin code (/my_company/plugins-enabled/plugin-mysystem/poll_mysystem.py):

from google.auth.transport.requests import Request
from google.oauth2 import id_token
import requests
import os

RUNTIME_URL = INTERNAL_URL
MEASUREMENT = "MY_MEASUREMENT"
CREDENTIALS = "GOOGLE_SERVICE_FILE.json"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = CREDENTIALS  # ENV VAR REQUIRED BY GOOGLE CODE BELOW
CLIENT_ID = VALUE_FROM_GOOGLE

exclude_fields = ["name", "version"] # Don't try to put these into influxdb from json response

def make_iap_request(url, client_id, method="GET", **kwargs):
    # Code provided by Google docs
    # Set the default timeout, if missing
    if "timeout" not in kwargs:
        kwargs["timeout"] = 90

    # Obtain an OpenID Connect (OIDC) token from metadata server or using service
    # account.
    open_id_connect_token = id_token.fetch_id_token(Request(), client_id)

    # Fetch the Identity-Aware Proxy-protected URL, including an
    # Authorization header containing "Bearer " followed by a
    # Google-issued OpenID Connect token for the service account.
    resp = requests.request(method, url, headers={"Authorization": "Bearer {}".format(open_id_connect_token)}, **kwargs)
    if resp.status_code == 403:
        raise Exception("Service account does not have permission to " "access the IAP-protected application.")
    elif resp.status_code != 200:
        raise Exception(
            "Bad response from application: {!r} / {!r} / {!r}".format(resp.status_code, resp.headers, resp.text)
        )
    else:
        return resp.json()


def print_results(results):
    """
    Take the results of a Dolores call and print influx line protocol results
    """
    for item in results["workflow"]:
        line_protocol_line_base = f"{MEASUREMENT},name={item['name']}"
        values = ""
        for key, value in item.items():
            if key not in exclude_fields:
                values = values + f",{key}={value}"
        values = values[1:]
        line_protocol_line = f"{line_protocol_line_base} {values}"
        print(line_protocol_line)


def main():
    current_runtime = make_iap_request(URL, CLIENT_ID, timeout=30)
    print_results(current_runtime)


if __name__== "__main__":
    main()

Relevant portion of the telegraf.conf file:

[[inputs.exec]]
  ## Commands array
  commands = [
    "/my_company/plugins-enabled/plugin-*/poll_*.py",
  ]

Agent section of config file

[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = "/var/log/telegraf/telegraf.log"
  hostname = ""
  omit_hostname = true

What do I do next?


Solution

  • The exec plugin is truncating your Exception message at the newline. If you wrap your call to make_iap_request in a try/except block, and then print(e, file=sys.stderr) rather than letting the Exception bubble all the way up, that should tell you more.

    def main():
        """
        Query URL and print line protocol
        """
        try:
            current_runtime = make_iap_request(URL, CLIENT_ID, timeout=30)
            print_results(current_runtime)
        except Exception as e:
            print(e, file=sys.stderr)
    

    Alternately your script could log error messages to it's own log file, rather than passing them back to Telegraf. This would give you more control over what's logged.

    I suspect you're running into an environment issue, where there's something different about how you're running it. If not permissions, it could be environment variable differences.