greptelegraftelegraf-inputs-plugintelegraf-pluginsprocstat

Telegraf - inputs.procstat pgrep plugin issue


Telegraf v1.0.1

I'm not able to see telegraf[._] (tree) metric anymore after I enabled [[inputs.procstat]] plugin.

Telegraf is installed successfully. Process is running. I'm pretty much using the normal settings for inputs plugins and output plugin.

This is what I got:

ubuntu@jenkins:/tmp/giga_aks_testing/ansible$ grep -C 2 jenkins /etc/telegraf/telegraf.d/telegraf-custom-host-services-processes.conf; echo ; ps -eAf|grep jenkins; echo; pgrep -f jenkins; echo; cat -n /var/log/telegraf/telegraf.log; echo date; echo; ps -eAf|grep telegraf; echo ; sudo service telegraf status

[[inputs.procstat]]
  exe = "jenkins"
  prefix = "pgrep_serviceprocess"


root      2875  3685  0  2016 pts/3    00:00:00 sudo su jenkins
root      2876  2875  0  2016 pts/3    00:00:00 su jenkins
jenkins   2877  2876  0  2016 pts/3    00:00:00 bash
jenkins  11645     1  0  2016 ?        00:00:01 /usr/bin/daemon --name=jenkins --inherit --env=JENKINS_HOME=/var/lib/jenkins --output=/var/log/jenkins/jenkins.log --pidfile=/var/run/jenkins/jenkins.pid -- /usr/bin/java -Djava.awt.headless=true -jar /usr/share/jenkins/jenkins.war --webroot=/var/cache/jenkins/war --httpPort=8080
jenkins  11647 11645  0  2016 ?        05:33:22 /usr/bin/java -Djava.awt.headless=true -jar /usr/share/jenkins/jenkins.war --webroot=/var/cache/jenkins/war --httpPort=8080
ubuntu   21973 26885  0 06:57 pts/0    00:00:00 grep --color=auto jenkins

2875
2876
11645
11647

     1  2017-01-07T06:54:00Z E! Error: procstat getting process, exe: [jenkins] pidfile: [] pattern: [] user: [] Failed to execute /usr/bin/pgrep. Error: 'exit status 1' 
     2  2017-01-07T06:55:00Z E! Error: procstat getting process, exe: [jenkins] pidfile: [] pattern: [] user: [] Failed to execute /usr/bin/pgrep. Error: 'exit status 1' 
     3  2017-01-07T06:56:00Z E! Error: procstat getting process, exe: [jenkins] pidfile: [] pattern: [] user: [] Failed to execute /usr/bin/pgrep. Error: 'exit status 1' 
     4  2017-01-07T06:57:00Z E! Error: procstat getting process, exe: [jenkins] pidfile: [] pattern: [] user: [] Failed to execute /usr/bin/pgrep. Error: 'exit status 1' 
date

telegraf 19336     1  0 05:45 pts/0    00:00:04 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraftelegraf.d
ubuntu   21977 26885  0 06:57 pts/0    00:00:00 grep --color=auto telegraf

telegraf Process is running [ OK ]
ubuntu@jenkins:/tmp/giga_aks_testing/ansible$ 

Why, the log file is showing an error when the jenkins process is running and pgrep -f jenkins is returning valid result.

PS: [[inputs.procstat]] plugin uses pgrep -f <exe_value_pattern> for it's logic if pattern = method is used, and pgrep <executable> if exe = method is used.

The full /etc/telegraf/telegraf.d/telegraf-custom-host-services-processes.conf file is:

[[inputs.procstat]]
  exe = "jenkins"
  prefix = "pgrep_serviceprocess"

[[inputs.procstat]]
  exe = "telegraf"
  prefix = "pgrep_serviceprocess"

[[inputs.procstat]]
  exe = "sshd"
  prefix = "pgrep_serviceprocess"

Solution

  • OK. Seems like this is an OPEN bug.

    Telegraf with [[inputs.procstat]] plugin entry won't barf if there's only one plugin in one file.

    If you specify multiple entries, even if those exe = <executables_processes> are running, Telegraf will start spitting those errors out (PS: It won't stop Telegraf service from working though).

    To fix the errors, this is what I did:

    [[inputs.procstat]]
      exe = "telegraf|.*"
      prefix = "pgrep_serviceprocess"
    

    Now, as pgrep is used for Telegraf's [[inputs.procstat]] plugin, it'll do this at OS level: pgrep "telegraf|.*".

    Now, you can also just give exe = "." (simplest) or like exe = ".*" but practically those will not be easy to find out who actually is trying to do a grep on all processes running on the system.

    NOTE: .* (will find every single processes running on the machine), so use it until we get a proper fix for this.

    Related Source code Github file: https://github.com/influxdata/telegraf/blob/master/plugins/inputs/procstat/procstat.go

    Related issue: https://github.com/influxdata/telegraf/issues/586

    I still couldn't find, why "telegraf.x.x" metrics are not available after I enabled [[inputs.procstat]] input. Is that due to a separate file? I'm not sure. But, I can see procstat.x.x metric tree but telegraf.x.x metric tree is not visible now.

    OR better,

    One can also use:

    [[inputs.procstat]]
      pattern = "."
      prefix = "pgrep_serviceprocess"
    

    The above will do: pgrep -f "." where pattern is . (to catch everything aka every processs/cmd/service running on a machine).

    OR (but the following is not scalable solution as you have to know for which user. In some boxes, Jenkins may be running using a user other than jenkins).

    [[inputs.procstat]]
      user = "jenkins"
      prefix = "pgrep_serviceprocess"
    

    The above will do: pgrep -u "jenkins" where user is jenkins (to catch everything aka every processs/cmd/service running on a machine).

    To check whether jenkins is running or not or if enhanceio is running or not, you can use [[inputs.exec]] plugin as well. I simply used: [[inputs.filestat]] plugin and it worked when I looked for the pid file for both tools. https://github.com/influxdata/telegraf/tree/master/plugins/inputs/filestat