Im using an RRD to monitor a data source. We are seeing many occasions where the RRD stores a NaN result despite the fact that we know data was received as we are also appending the received data to a file for testing. When we examine the difference we see the following:
I tried to paste the data as two columns but it hasnt structured properly but in essence what we see below is two columns of a spreadsheet. The left column is the rrd dump and the right column is the actual data that arrived at that time.
" <!-- 2017-09-28 06:00:00 UTC / 1506578400 --> <row><v>1.1999200000e+06</v></row>" 1506578412:1202000
" <!-- 2017-09-28 06:05:00 UTC / 1506578700 --> <row><v>1.2538400000e+06</v></row>" 1506578712:1256000
" <!-- 2017-09-28 06:10:00 UTC / 1506579000 --> <row><v>1.2310400000e+06</v></row>" 1506579012:1230000
" <!-- 2017-09-28 06:15:00 UTC / 1506579300 --> <row><v>1.2415200000e+06</v></row>" 1506579312:1242000
" <!-- 2017-09-28 06:20:00 UTC / 1506579600 --> <row><v>1.2304800000e+06</v></row>" 1506579612:1230000
" <!-- 2017-09-28 06:25:00 UTC / 1506579900 --> <row><v>1.2357600000e+06</v></row>" 1506579912:1236000
" <!-- 2017-09-28 06:30:00 UTC / 1506580200 --> <row><v>1.1284800000e+06</v></row>" 1506580212:1124000
" <!-- 2017-09-28 06:35:00 UTC / 1506580500 --> <row><v>1.2238400000e+06</v></row>" 1506580512:1228000
" <!-- 2017-09-28 06:40:00 UTC / 1506580800 --> <row><v>NaN</v></row>" 1506580813:1222000
" <!-- 2017-09-28 06:45:00 UTC / 1506581100 --> <row><v>1.2400000000e+06</v></row>" 1506581112:1240000
" <!-- 2017-09-28 06:50:00 UTC / 1506581400 --> <row><v>1.2284800000e+06</v></row>" 1506581412:1228000
" <!-- 2017-09-28 06:55:00 UTC / 1506581700 --> <row><v>8.9392000000e+05</v></row>" 1506581712:880000
" <!-- 2017-09-28 07:00:00 UTC / 1506582000 --> <row><v>NaN</v></row>" 1506582014:1000000
" <!-- 2017-09-28 07:05:00 UTC / 1506582300 --> <row><v>NaN</v></row>" 1506582315:738000
" <!-- 2017-09-28 07:10:00 UTC / 1506582600 --> <row><v>1.1760000000e+06</v></row>" 1506582613:1176000
" <!-- 2017-09-28 07:15:00 UTC / 1506582900 --> <row><v>1.1874800000e+06</v></row>" 1506582912:1188000
" <!-- 2017-09-28 07:20:00 UTC / 1506583200 --> <row><v>1.2033600000e+06</v></row>" 1506583212:1204000
" <!-- 2017-09-28 07:25:00 UTC / 1506583500 --> <row><v>1.2097600000e+06</v></row>" 1506583512:1210000
" <!-- 2017-09-28 07:30:00 UTC / 1506583800 --> <row><v>1.0717600000e+06</v></row>" 1506583811:1066000
" <!-- 2017-09-28 07:35:00 UTC / 1506584100 --> <row><v>NaN</v></row>" 1506584112:1222000
" <!-- 2017-09-28 07:40:00 UTC / 1506584400 --> <row><v>1.1760000000e+06</v></row>" 1506584412:1176000
" <!-- 2017-09-28 07:45:00 UTC / 1506584700 --> <row><v>1.2048000000e+06</v></row>" 1506584712:1206000
" <!-- 2017-09-28 07:50:00 UTC / 1506585000 --> <row><v>1.0255200000e+06</v></row>" 1506585012:1018000
" <!-- 2017-09-28 07:55:00 UTC / 1506585300 --> <row><v>1.2004000000e+06</v></row>" 1506585312:1208000
" <!-- 2017-09-28 08:00:00 UTC / 1506585600 --> <row><v>1.1676800000e+06</v></row>" 1506585612:1166000
" <!-- 2017-09-28 08:05:00 UTC / 1506585900 --> <row><v>1.2024800000e+06</v></row>" 1506585912:1204000
" <!-- 2017-09-28 08:10:00 UTC / 1506586200 --> <row><v>1.2116800000e+06</v></row>" 1506586212:1212000
" <!-- 2017-09-28 08:15:00 UTC / 1506586500 --> <row><v>NaN</v></row>" 1506586513:886000
" <!-- 2017-09-28 08:20:00 UTC / 1506586800 --> <row><v>1.1940000000e+06</v></row>" 1506586812:1194000
" <!-- 2017-09-28 08:25:00 UTC / 1506587100 --> <row><v>1.1959200000e+06</v></row>" 1506587112:1196000
" <!-- 2017-09-28 08:30:00 UTC / 1506587400 --> <row><v>NaN</v></row>" 1506587413:1206000
" <!-- 2017-09-28 08:35:00 UTC / 1506587700 --> <row><v>1.1440000000e+06</v></row>" 1506587712:1144000
" <!-- 2017-09-28 08:40:00 UTC / 1506588000 --> <row><v>NaN</v></row>" 1506588013:668000
" <!-- 2017-09-28 08:45:00 UTC / 1506588300 --> <row><v>1.2080000000e+06</v></row>" 1506588312:1208000
" <!-- 2017-09-28 08:50:00 UTC / 1506588600 --> <row><v>NaN</v></row>" 1506588613:1156000
" <!-- 2017-09-28 08:55:00 UTC / 1506588900 --> <row><v>1.2080000000e+06</v></row>" 1506588912:1208000
" <!-- 2017-09-28 09:00:00 UTC / 1506589200 --> <row><v>1.1945600000e+06</v></row>" 1506589212:1194000
" <!-- 2017-09-28 09:05:00 UTC / 1506589500 --> <row><v>1.1786400000e+06</v></row>" 1506589512:1178000
" <!-- 2017-09-28 09:10:00 UTC / 1506589800 --> <row><v>1.1396000000e+06</v></row>" 1506589811:1138000
" <!-- 2017-09-28 09:15:00 UTC / 1506590100 --> <row><v>NaN</v></row>" 1506590113:1006000
" <!-- 2017-09-28 09:20:00 UTC / 1506590400 --> <row><v>1.1780000000e+06</v></row>" 1506590412:1178000
" <!-- 2017-09-28 09:25:00 UTC / 1506590700 --> <row><v>1.1799200000e+06</v></row>" 1506590712:1180000
" <!-- 2017-09-28 09:30:00 UTC / 1506591000 --> <row><v>1.1953600000e+06</v></row>" 1506591012:1196000
" <!-- 2017-09-28 09:35:00 UTC / 1506591300 --> <row><v>1.1806400000e+06</v></row>" 1506591312:1180000
" <!-- 2017-09-28 09:40:00 UTC / 1506591600 --> <row><v>1.1588800000e+06</v></row>" 1506591612:1158000
" <!-- 2017-09-28 09:45:00 UTC / 1506591900 --> <row><v>1.2002400000e+06</v></row>" 1506591912:1202000
" <!-- 2017-09-28 09:50:00 UTC / 1506592200 --> <row><v>1.0656800000e+06</v></row>" 1506592212:1060000
" <!-- 2017-09-28 09:55:00 UTC / 1506592500 --> <row><v>1.2078400000e+06</v></row>" 1506592512:1214000
" <!-- 2017-09-28 10:00:00 UTC / 1506592800 --> <row><v>1.1640800000e+06</v></row>" 1506592812:1162000
" <!-- 2017-09-28 10:05:00 UTC / 1506593100 --> <row><v>1.1754400000e+06</v></row>" 1506593112:1176000
We can see the occasions where the data seems not to be accepted are almost always when the time it arrives is somewhat outside the trend.
How can we go about widening the acceptance criteria so that all of these datapoints are accepted?
RRD info for the RRD in question is shown below:
root@ra:/var/www/genie/public_html# rrdtool info /an/data/SI1.rrd
filename = "/an/data/SI1.rrd"
rrd_version = "0003"
step = 300
last_update = 1506594312
header_size = 1000
ds[probe1-temp].index = 0
ds[probe1-temp].type = "GAUGE"
ds[probe1-temp].minimal_heartbeat = 300
ds[probe1-temp].min = 0.0000000000e+00
ds[probe1-temp].max = 5.0000000000e+06
ds[probe1-temp].last_ds = "1226000"
ds[probe1-temp].value = NaN
ds[probe1-temp].unknown_sec = 12
rra[0].cf = "MIN"
rra[0].rows = 1440
rra[0].cur_row = 238
rra[0].pdp_per_row = 12
rra[0].xff = 5.0000000000e-01
rra[0].cdp_prep[0].value = 1.1754400000e+06
rra[0].cdp_prep[0].unknown_datapoints = 2
rra[1].cf = "MAX"
rra[1].rows = 1440
rra[1].cur_row = 1220
rra[1].pdp_per_row = 12
rra[1].xff = 5.0000000000e-01
rra[1].cdp_prep[0].value = 1.2140000000e+06
rra[1].cdp_prep[0].unknown_datapoints = 2
rra[2].cf = "AVERAGE"
rra[2].rows = 1440
rra[2].cur_row = 1205
rra[2].pdp_per_row = 1
rra[2].xff = 5.0000000000e-01
rra[2].cdp_prep[0].value = NaN
rra[2].cdp_prep[0].unknown_datapoints = 0
root@ra:#
You have set the DS heartbeat to 300, but also the step to 300.
This means that, if your data arrive 300s or more apart, then they will be stored as NaN, which is what you are seeing. From the stats you give, you can see that on the NaN rows, the actual time interval is 301 or 302 sec, which is >300 and so results in a NaN as it exceeds the heartbeat time.
You should normally set the heartbeat to twice the expected data interval, IE twice the step, so as to handle this case.
Try setting the heartbeat to 600; this should solve the problem.