linuxdatabasetime-seriescnosdb

The node is down, but there are no exceptions in the logs


I created a single-node container. After running it for a few days, the node hung up, but no problems were found in the logs and system logs. What was going on?

I checked the cnosdb log and the deadline is 20240120203253, as follows:

2024-01-20T20:32:53.963266264Z INFO tskv::compaction::picker: Picker: Calculate level scores: [ { Level-1: 1.6 }, { Level-4: 1.6 }, { Level-3: 0.4 } ]

2024-01-20T20:32:53.963393433Z INFO tskv::compaction::picker: Picker: picked level: 1 to 2

2024-01-20T20:32:53.965405243Z INFO tskv::compaction::picker: Picker: picked L1 files(2) does not reach trigger(4), return None

2024-01-20T20:32:53.965539938Z INFO tskv::compaction::job: Starting compaction on ts_family 3

2024-01-20T20:32:53.965603168Z INFO tskv::compaction::picker: Picker: picked no level

2024-01-20T20:32:53.966003610Z INFO tskv::compaction::job: Compacting on vnode(job start): {12: true, 6: true, 3: true} costs 0 sec

and I check the sys log with dmesg -T,as follow:

[六 1月 20 10:05:16 2024] [36395] 89 36395 22974 268 48 0 0 pickup

[六 1月 20 10:05:16 2024] Out of memory: Kill process 32449 (tokio-runtime-w) score 671 or sacrifice child

[六 1月 20 10:05:16 2024] Killed process 32449 (tokio-runtime-w), UID 0, total-vm:323738056kB, anon-rss:179339976kB, file-rss:0kB, shmem-rss:0kB

[六 1月 20 10:06:14 2024] docker0: port 1(veth798dd8b) entered disabled state

[六 1月 20 10:06:14 2024] docker0: port 1(veth798dd8b) entered disabled state

The latest OOM log is very different from the cnosdb log. So why is the cnosdb service down?


Solution

  • After subsequent trigger OOM testing, it was found that the OOM log time recorded by the system would deviate to a certain extent from the system time. According to the offset time, we locate the system log from the last log printed by cnosdb, and there is a record of Cnosd DB process OOM. as follow:

    1. System log expiration time before triggering OOM
    dmesg -T
    ...
    [Wed Jan 24 20:53:41 2024] docker0: port 7(xxx) entered forwarding state
    
    1. current time
    [root@xxx ~]# date
    Mon Jan 29 10:30:39 CST 2024
    
    1. Trigger OOM
    [root@cicd_ujv23 ~]# stress --vm 10 --vm-bytes 25G --vm-keep
    stress: info: [38900] dispatching hogs: 0 cpu, 0 io, 10 vm, 0 hdd
    stress: FAIL: [38900] (415) <-- worker 38904 got signal 9
    stress: WARN: [38900] (417) now reaping child worker processes
    stress: FAIL: [38900] (415) <-- worker 38910 got signal 9
    stress: WARN: [38900] (417) now reaping child worker processes
    stress: FAIL: [38900] (451) failed run completed in 40s
    
    1. check the OOM log:
    dmesg -T
    [Sun Jan 28 11:59:09 2024] Out of memory: Kill process 38910 (stress) score 92 or sacrifice child
    [Sun Jan 28 11:59:09 2024] Killed process 38910 (stress), UID 0, total-vm:26221716kB, anon-rss:25472716kB, file-rss:0kB, shmem-rss:0kB