I have a simple watchdog mechanism made as follow:
time_t timer = time(NULL);
int n = 0;
while (true) {
if (time(NULL) >= timer + 10) {
timer = time(NULL);
char szData[64];
memset(szData, 0, 64);
sprintf(szData, "s|%d|%s", tid, function_name);
if ((n = strlen(szData)) > 0) {
int t = 0;
int fp = 0;
while ((fp = open("/proc/counters", O_WRONLY | O_EXCL)) == -1 && errno == EACCESS) {
if (++t > 4) {
break;
}
usleep(260000);
}
if (fp == -1) {
return 1;
}
write(fp, szData, n);
close(fp);
}
}
}
In this condition, each counter should reach a maximum value of 10, but my problem is that, sometimes, some of them reach a maximum value of 20, then 30, and so on. On the kernel module side, I see that indeed the reset command does not arrive in time and, on the next round, it receives two commands in one second, like the first has been delayed. Example:
Content of the /proc counters file
*856 7 20/600 thread_function_name
Debug prints by the kernel module
Aug 26 11:16:38 XWEB-PRO kern.info kernel: [ 114.086453] register 856
Aug 26 11:16:38 XWEB-PRO kern.info kernel: [ 124.138523] register 856
Aug 26 11:16:58 XWEB-PRO kern.info kernel: [ 134.190508] register 856
Aug 26 11:16:58 XWEB-PRO kern.info kernel: [ 144.242274] register 856
Aug 26 11:16:58 XWEB-PRO kern.info kernel: [ 144.242277] register 856
Aug 26 11:17:08 XWEB-PRO kern.info kernel: [ 154.294433] register 856
Aug 26 11:17:28 XWEB-PRO kern.info kernel: [ 164.346516] register 856
Aug 26 11:17:28 XWEB-PRO kern.info kernel: [ 174.398552] register 856
Aug 26 11:17:38 XWEB-PRO kern.info kernel: [ 184.468022] register 856
Aug 26 11:17:48 XWEB-PRO kern.info kernel: [ 194.522241] register 856
As you can see, I missed the command at 11:16:48, but I have 3 at 11:16:58. Then I missed the one at 11:17:18, but have 2 at 11:17:28. I already tried fflush, fsync and sync, but with no luck. Anybody can point me in a right direction? Thank you
Finally I spotted the problem. Looks like it was a race condition issue. There are 10 threads that write at the same time on the /proc file, so, probably, some write operation were missed by the kernel module (indeed the reset always happened at multiple of 10 seconds). I surrounded the open/write/close sequence in the user space in a mutex condition and the problem seems gone.