I'm new to telegraf / influxdb2 :)
I've made a quick docker setup to monitor a cloud Ubuntu VM. To do so, I've created a "default like" docker-compose file with telegraf, influxdb2 & grafana. It works just fine on my ubuntu laptop. When I run it on the ubuntu cloud VM, I have very strange numbers for my metrics.
For example
What is very strange to me is that the docker inputs is working fine :/
This is driving me crazy :) did someone has already encounter the same kind of thing ? And again, works just fine on my laptop :)
docker-compose.yml
version: '3.1'
services:
grafana:
image: grafana/grafana
container_name: grafana
restart: unless-stopped
depends_on:
- telegraf
volumes:
- ./grafana/provisioning/:/etc/grafana/provisioning/
- ./grafana/dashboards/:/var/lib/grafana/dashboards/
- ./grafana/grafana.ini:/etc/grafana/grafana.ini
ports:
- 3000:3000
influxdb:
image: influxdb:2.5.1
container_name: influxdb
restart: unless-stopped
ports:
- 8086:8086
environment:
- DOCKER_INFLUXDB_INIT_USERNAME=xxxx
- DOCKER_INFLUXDB_INIT_PASSWORD=yyyyy
- DOCKER_INFLUXDB_INIT_ORG=myorg
- DOCKER_INFLUXDB_INIT_BUCKET=mybucket
- DOCKER_INFLUXDB_INIT_RETENTION=3w
- DOCKER_INFLUXDB_INIT_MODE=setup
- DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=my-token
volumes:
- data-influx:/var/lib/influxdb2
telegraf:
image: telegraf:1.24.3-alpine
container_name: telegraf
restart: unless-stopped
depends_on:
- influxdb
volumes:
- ./telegraf/etc/telegraf.conf:/etc/telegraf/telegraf.conf:ro
- /var/run/docker.sock:/var/run/docker.sock
- /sys:/rootfs/sys:ro
- /proc:/rootfs/proc:ro
- /etc:/rootfs/etc:ro
user: telegraf:999
volumes:
data-influx:
telegraf.conf
[global_tags]
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
flush_buffer_when_full = true
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
debug = false
quiet = false
hostname = "LoubVM"
[[outputs.influxdb_v2]]
urls = ["http://influxdb:8086"]
token = "my-token"
organization = "myorg"
bucket = "mybucket"
[[inputs.statsd]]
protocol = "udp"
max_tcp_connections = 250
tcp_keep_alive = false
service_address = ":8125"
delete_gauges = true
delete_counters = true
delete_sets = true
delete_timings = true
percentiles = [90]
metric_separator = "_"
parse_data_dog_tags = false
allowed_pending_messages = 10000
percentile_limit = 1000
[[inputs.cpu]]
percpu = true
totalcpu = true
[[inputs.disk]]
mount_points = ["/"]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.net]]
[[inputs.netstat]]
[[inputs.interrupts]]
[[inputs.linux_sysctl_fs]]
[[inputs.docker]]
endpoint = "unix:///var/run/docker.sock"
gather_services = false
source_tag = false
container_name_include = []
container_name_exclude = []
timeout = "5s"
total = false
docker_label_include = []
docker_label_exclude = []
According to my cloud provider, it's because my ubuntu VM is a VPS (Virtual Private Server). Telegraf catch some hypervisor's metric which ends in wrong data in influx/grafana.
My workaround is about creating some new metric with custom scripts, scheduled by cron, that send metric to statsd (component of telegraf, that you can use to send data to telegraf). My monitoring script is like
function SendToStatsd {
measurement=$1
value=$2
echo "${measurement}:${value}|c" | nc -w10 -u 127.0.0.1 8125
}
## Uptime # Threshold in grafana 2419200 (4 weeks)
myUptime=$(awk '{print $1}' /proc/uptime)
SendToStatsd myUptime $myUptime
## Swap
mySwap=$(free |grep Swap)
mySwapTotal=$(echo $mySwap| awk '{print $2}')
SendToStatsd mySwapTotal $mySwapTotal
# etc etc ...