[SOLVED] Graphite showing rolling gap in data

Graphite showing rolling gap in data

I recently upgraded one of our Graphite instances from 0.9.2 to 1.1.1, and have since run into an issue where, for the lack of a better word, there is a rolling gap of data.

It shows the last few minutes correctly (I'm guessing what's in carbon cache), and after about 10-15 minutes past, it shows all of the data correctly as well.

However, inside that 10-15 minute gap, it's completely blank. I can see the gap both in Graphite, and in Grafana. It disappears after restarting carbon cache, and then comes back about a day later.

Example screenshot:

This happens for most graphs/dashboards I have.

I've spent a lot of effort optimizing disk IO, so I doubt it to be the case -> Cloudwatch shows 100% burst credit for disk. It's an m3.xlarge instance with 4 cores and 16 GB RAM. Swap file is on ephemeral storage and looks barely utilized.

Using 1 Carbon Cache instance with Whisper backend.

storage_schemas.conf:

[carbon]
pattern = ^carbon\.
retentions = 60:90d
[dumbo]
pattern = ^collectd\.dumbo   # load test containers, we don't care about their data
retentions = 300:1
[collectd]
pattern = ^collectd
retentions = 10s:8h,30s:1d,1m:3d,5m:30d,15m:90d
[statsite]
pattern = ^statsite
retentions = 10s:8h,30s:1d,1m:3d,5m:30d,15m:90d
[default_1min_for_1day]
pattern = .*
retentions = 60s:1d

Non-default (or potentially relevant) carbon.conf settings:

[cache]
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 100   # was slagging disk write IO until I dropped it down from 500
MAX_CREATES_PER_MINUTE = 50
CACHE_WRITE_STRATEGY = sorted
RELAY_METHOD = rules
DESTINATIONS = 127.0.0.1:2004
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_QUEUE_SIZE = 10000

Graphite local_settings.py

CARBONLINK_TIMEOUT = 10.0
CARBONLINK_QUERY_BULK = True
USE_WORKER_POOL = False

Solution

We've seen this with some workloads on 1.1.1, can you try updating carbon to current master? If not 1.1.2 will be released shortly which should solve the problem.