ubuntukubernetescassandracgroups

Why OutOfMem error for Cassandra pod on Kubernetes despite sufficient RAM in worker node?


I am running a cassandra pod on a Kubernetes worker node named kworker4 which has 16 GB RAM and no other pod is running in that worker node. I am using bitnami/cassandra.

The memory and CPU usage is like below in the Kubernetes cluster.

NAME       CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
kmaster    225m         2%     1672Mi          10%
kworker1   67m          0%     2499Mi          16%
kworker2   28m          0%     1834Mi          11%
kworker3   78m          0%     1487Mi          9%
kworker4   359m         4%     1988Mi          12%
kworker5   76m          0%     1244Mi          7%

When I set maxHeapSize: "2048M" in values.yaml of bitnami/cassandra helm chart, the cassandra-0 pod is starting up.

But when I set maxHeapSize: "4096M" in values.yaml of bitnami/cassandra helm chart, the cassandra-0 pod is not starting up although the worker node kworker4 is running on a machine that has 16 GB RAM and from the top command it is found that memory utilization is very low without the cassandra-0 pod.

what is the issue?

I checked the syslog of Ubuntu and found the following:

Jul 10 13:44:47 kworker4 systemd[1]: Started Session 66 of User kadmin.
Jul 10 13:49:37 kworker4 NetworkManager[938]: <info>  [1720619377.9003] dhcp4 (wlp5s0): state changed new lease, address=10.227.25.233
Jul 10 13:52:17 kworker4 kubelet[2861]: I0710 13:52:17.334274    2861 topology_manager.go:215] "Topology Admit Handler" podUID="45d3f529-6498-42a7-a26c-b255967b2533" podNamespace="default" podName="cassandra-0"
Jul 10 13:52:17 kworker4 kubelet[2861]: E0710 13:52:17.334380    2861 cpu_manager.go:395] "RemoveStaleState: removing container" podUID="f2e22176-28e4-4a8f-87c1-2806a522832e" containerName="cassandra"
Jul 10 13:52:17 kworker4 kubelet[2861]: I0710 13:52:17.334422    2861 memory_manager.go:354] "RemoveStaleState removing state" podUID="f2e22176-28e4-4a8f-87c1-2806a522832e" containerName="cassandra"
Jul 10 13:52:17 kworker4 systemd[1]: Created slice libcontainer container kubepods-burstable-pod45d3f529_6498_42a7_a26c_b255967b2533.slice.
Jul 10 13:52:17 kworker4 kubelet[2861]: I0710 13:52:17.519739    2861 reconciler_common.go:247] "operationExecutor.VerifyControllerAttachedVolume started for volume \"empty-dir\" (UniqueName: \"kubernetes.io/empty-dir/45d3f529-6498-42a7-a26c-b255967b2533-empty-dir\") pod \"cassandra-0\" (UID: \"45d3f529-6498-42a7-a26c-b255967b2533\") " pod="default/cassandra-0"
Jul 10 13:52:17 kworker4 kubelet[2861]: I0710 13:52:17.519848    2861 reconciler_common.go:247] "operationExecutor.VerifyControllerAttachedVolume started for volume \"data-cassandra-0\" (UniqueName: \"kubernetes.io/local-volume/data-cassandra-0\") pod \"cassandra-0\" (UID: \"45d3f529-6498-42a7-a26c-b255967b2533\") " pod="default/cassandra-0"
Jul 10 13:52:17 kworker4 kubelet[2861]: I0710 13:52:17.621078    2861 operation_generator.go:664] "MountVolume.MountDevice succeeded for volume \"data-cassandra-0\" (UniqueName: \"kubernetes.io/local-volume/data-cassandra-0\") pod \"cassandra-0\" (UID: \"45d3f529-6498-42a7-a26c-b255967b2533\") device mount path \"/mnt/local-persistent-volumes/data-cassandra-0\"" pod="default/cassandra-0"
Jul 10 13:52:17 kworker4 containerd[1054]: time="2024-07-10T13:52:17.946171772Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:cassandra-0,Uid:45d3f529-6498-42a7-a26c-b255967b2533,Namespace:default,Attempt:0,}"
Jul 10 13:52:18 kworker4 networkd-dispatcher[957]: WARNING:Unknown index 39 seen, reloading interface list
Jul 10 13:52:18 kworker4 kernel: [111165.055341] weave: port 2(vethwepl9deef76) entered blocking state
Jul 10 13:52:18 kworker4 kernel: [111165.055350] weave: port 2(vethwepl9deef76) entered disabled state
Jul 10 13:52:18 kworker4 kernel: [111165.055607] device vethwepl9deef76 entered promiscuous mode
Jul 10 13:52:18 kworker4 NetworkManager[938]: <info>  [1720619538.4870] manager: (vethwepg9deef76): new Veth device (/org/freedesktop/NetworkManager/Devices/37)
Jul 10 13:52:18 kworker4 systemd-udevd[293361]: Using default interface naming scheme 'v249'.
Jul 10 13:52:18 kworker4 NetworkManager[938]: <info>  [1720619538.5216] manager: (vethwepl9deef76): new Veth device (/org/freedesktop/NetworkManager/Devices/38)
Jul 10 13:52:18 kworker4 networkd-dispatcher[957]: ERROR:Unknown interface index 39 seen even after reload
Jul 10 13:52:18 kworker4 networkd-dispatcher[957]: WARNING:Unknown index 39 seen, reloading interface list
Jul 10 13:52:18 kworker4 kernel: [111165.097012] eth0: renamed from vethwepg9deef76
Jul 10 13:52:18 kworker4 systemd-networkd[921]: vethwepl9deef76: Link UP
Jul 10 13:52:18 kworker4 networkd-dispatcher[957]: ERROR:Unknown interface index 39 seen even after reload
Jul 10 13:52:18 kworker4 systemd-networkd[921]: vethwepl9deef76: Gained carrier
Jul 10 13:52:18 kworker4 NetworkManager[938]: <info>  [1720619538.5638] device (vethwepl9deef76): carrier: link connected
Jul 10 13:52:18 kworker4 kernel: [111165.132876] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Jul 10 13:52:18 kworker4 kernel: [111165.132910] IPv6: ADDRCONF(NETDEV_CHANGE): vethwepl9deef76: link becomes ready
Jul 10 13:52:18 kworker4 kernel: [111165.132944] weave: port 2(vethwepl9deef76) entered blocking state
Jul 10 13:52:18 kworker4 kernel: [111165.132947] weave: port 2(vethwepl9deef76) entered forwarding state
Jul 10 13:52:18 kworker4 containerd[1054]: time="2024-07-10T19:22:18.687796519+05:30" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
Jul 10 13:52:18 kworker4 containerd[1054]: time="2024-07-10T19:22:18.687933851+05:30" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
Jul 10 13:52:18 kworker4 containerd[1054]: time="2024-07-10T19:22:18.687970510+05:30" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Jul 10 13:52:18 kworker4 containerd[1054]: time="2024-07-10T19:22:18.688176283+05:30" level=info msg="loading plugin \"io.containerd.ttrpc.v1.pause\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Jul 10 13:52:18 kworker4 systemd[1]: Started libcontainer container 9deef764a0e07a31d7812c23efa211c36d9a74b470c5c07b9b6e0586dade5750.
Jul 10 13:52:18 kworker4 containerd[1054]: time="2024-07-10T13:52:18.770045331Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:cassandra-0,Uid:45d3f529-6498-42a7-a26c-b255967b2533,Namespace:default,Attempt:0,} returns sandbox id \"9deef764a0e07a31d7812c23efa211c36d9a74b470c5c07b9b6e0586dade5750\""
Jul 10 13:52:18 kworker4 containerd[1054]: time="2024-07-10T13:52:18.790190901Z" level=info msg="CreateContainer within sandbox \"9deef764a0e07a31d7812c23efa211c36d9a74b470c5c07b9b6e0586dade5750\" for container &ContainerMetadata{Name:cassandra,Attempt:0,}"
Jul 10 13:52:19 kworker4 containerd[1054]: time="2024-07-10T13:52:19.007961508Z" level=info msg="CreateContainer within sandbox \"9deef764a0e07a31d7812c23efa211c36d9a74b470c5c07b9b6e0586dade5750\" for &ContainerMetadata{Name:cassandra,Attempt:0,} returns container id \"2cb8ea27c355aae5a717ce8c7ffab53f627ac0ce73f7349a0a0c009c6bb80752\""
Jul 10 13:52:19 kworker4 containerd[1054]: time="2024-07-10T13:52:19.008715302Z" level=info msg="StartContainer for \"2cb8ea27c355aae5a717ce8c7ffab53f627ac0ce73f7349a0a0c009c6bb80752\""
Jul 10 13:52:19 kworker4 systemd[1]: Started libcontainer container 2cb8ea27c355aae5a717ce8c7ffab53f627ac0ce73f7349a0a0c009c6bb80752.
Jul 10 13:52:19 kworker4 containerd[1054]: time="2024-07-10T13:52:19.141433055Z" level=info msg="StartContainer for \"2cb8ea27c355aae5a717ce8c7ffab53f627ac0ce73f7349a0a0c009c6bb80752\" returns successfully"
Jul 10 13:52:19 kworker4 avahi-daemon[935]: Joining mDNS multicast group on interface vethwepl9deef76.IPv6 with address fe80::5cae:7cff:fe8b:e880.
Jul 10 13:52:19 kworker4 avahi-daemon[935]: New relevant interface vethwepl9deef76.IPv6 for mDNS.
Jul 10 13:52:19 kworker4 systemd-networkd[921]: vethwepl9deef76: Gained IPv6LL
Jul 10 13:52:19 kworker4 avahi-daemon[935]: Registering new address record for fe80::5cae:7cff:fe8b:e880 on vethwepl9deef76.*.
Jul 10 13:52:30 kworker4 systemd[1]: cri-containerd-2cb8ea27c355aae5a717ce8c7ffab53f627ac0ce73f7349a0a0c009c6bb80752.scope: A process of this unit has been killed by the OOM killer.
Jul 10 13:52:30 kworker4 kernel: [111176.800862] java invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=869
Jul 10 13:52:30 kworker4 kernel: [111176.800866] CPU: 3 PID: 293719 Comm: java Not tainted 5.15.0-113-generic #123-Ubuntu
Jul 10 13:52:30 kworker4 kernel: [111176.800868] Hardware name: LENOVO 20Q6S2WR01/20Q6S2WR01, BIOS R0ZET47W (1.25 ) 01/27/2021
Jul 10 13:52:30 kworker4 kernel: [111176.800869] Call Trace:
Jul 10 13:52:30 kworker4 kernel: [111176.800870]  <TASK>
Jul 10 13:52:30 kworker4 kernel: [111176.800872]  show_stack+0x52/0x5c
Jul 10 13:52:30 kworker4 kernel: [111176.800875]  dump_stack_lvl+0x4a/0x63
Jul 10 13:52:30 kworker4 kernel: [111176.800878]  dump_stack+0x10/0x16
Jul 10 13:52:30 kworker4 kernel: [111176.800878]  dump_header+0x53/0x228
Jul 10 13:52:30 kworker4 kernel: [111176.800881]  oom_kill_process.cold+0xb/0x10
Jul 10 13:52:30 kworker4 kernel: [111176.800882]  out_of_memory+0x106/0x2e0
Jul 10 13:52:30 kworker4 kernel: [111176.800885]  mem_cgroup_out_of_memory+0x13f/0x160
Jul 10 13:52:30 kworker4 kernel: [111176.800888]  try_charge_memcg+0x687/0x740
Jul 10 13:52:30 kworker4 kernel: [111176.800889]  charge_memcg+0x45/0xb0
Jul 10 13:52:30 kworker4 kernel: [111176.800890]  __mem_cgroup_charge+0x2d/0x90
Jul 10 13:52:30 kworker4 kernel: [111176.800892]  do_anonymous_page+0x114/0x3c0
Jul 10 13:52:30 kworker4 kernel: [111176.800893]  handle_pte_fault+0x20a/0x240
Jul 10 13:52:30 kworker4 kernel: [111176.800895]  __handle_mm_fault+0x405/0x6f0
Jul 10 13:52:30 kworker4 kernel: [111176.800897]  handle_mm_fault+0xd8/0x2c0
Jul 10 13:52:30 kworker4 kernel: [111176.800898]  do_user_addr_fault+0x1c9/0x670
Jul 10 13:52:30 kworker4 kernel: [111176.800901]  exc_page_fault+0x77/0x170
Jul 10 13:52:30 kworker4 kernel: [111176.800902]  asm_exc_page_fault+0x27/0x30
Jul 10 13:52:30 kworker4 kernel: [111176.800905] RIP: 0033:0x7ff983e72220
Jul 10 13:52:30 kworker4 kernel: [111176.800907] Code: 03 48 8d 05 96 57 8f 00 8b 13 89 10 e9 6e ff ff ff 90 0f 1f 84 00 00 00 00 00 55 48 39 f7 48 89 e5 73 12 0f 1f 80 00 00 00 00 <c6> 07 00 48 01 d7 48 39 fe 77 f5 5d c3 90 66 90 55 45 0f b6 c9 48
Jul 10 13:52:30 kworker4 kernel: [111176.800908] RSP: 002b:00007ff9831fdc00 EFLAGS: 00010206
Jul 10 13:52:30 kworker4 kernel: [111176.800909] RAX: 0000000000001000 RBX: 00007ff982c8f638 RCX: 00007ff9853208a3
Jul 10 13:52:30 kworker4 kernel: [111176.800911] RDX: 0000000000001000 RSI: 0000000800000000 RDI: 00000007bee67000
Jul 10 13:52:30 kworker4 kernel: [111176.800911] RBP: 00007ff9831fdc00 R08: 00000000ffffffff R09: 00000000f0000000
Jul 10 13:52:30 kworker4 kernel: [111176.800912] R10: 0000000000000032 R11: 0000000000000246 R12: 0000000710000000
Jul 10 13:52:30 kworker4 kernel: [111176.800913] R13: 00000000f0000000 R14: 0000000000000000 R15: 0000000800000000
Jul 10 13:52:30 kworker4 kernel: [111176.800915]  </TASK>
Jul 10 13:52:30 kworker4 kernel: [111176.800916] memory: usage 3145728kB, limit 3145728kB, failcnt 56
Jul 10 13:52:30 kworker4 kernel: [111176.800917] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Jul 10 13:52:30 kworker4 kernel: [111176.800918] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod45d3f529_6498_42a7_a26c_b255967b2533.slice:
Jul 10 13:52:30 kworker4 kernel: [111176.800926] anon 3213541376
Jul 10 13:52:30 kworker4 kernel: [111176.800926] file 90112
Jul 10 13:52:30 kworker4 kernel: [111176.800926] kernel_stack 65536
Jul 10 13:52:30 kworker4 kernel: [111176.800926] pagetables 6504448
Jul 10 13:52:30 kworker4 kernel: [111176.800926] percpu 47296
Jul 10 13:52:30 kworker4 kernel: [111176.800926] sock 0
Jul 10 13:52:30 kworker4 kernel: [111176.800926] shmem 0
Jul 10 13:52:30 kworker4 kernel: [111176.800926] file_mapped 0
Jul 10 13:52:30 kworker4 kernel: [111176.800926] file_dirty 90112
Jul 10 13:52:30 kworker4 kernel: [111176.800926] file_writeback 0
Jul 10 13:52:30 kworker4 kernel: [111176.800926] swapcached 0
Jul 10 13:52:30 kworker4 kernel: [111176.800926] anon_thp 0
Jul 10 13:52:30 kworker4 kernel: [111176.800926] file_thp 0
Jul 10 13:52:30 kworker4 kernel: [111176.800926] shmem_thp 0
Jul 10 13:52:30 kworker4 kernel: [111176.800926] inactive_anon 3213463552
Jul 10 13:52:30 kworker4 kernel: [111176.800926] active_anon 12288
Jul 10 13:52:30 kworker4 kernel: [111176.800926] inactive_file 49152
Jul 10 13:52:30 kworker4 kernel: [111176.800926] active_file 40960
Jul 10 13:52:30 kworker4 kernel: [111176.800926] unevictable 0
Jul 10 13:52:30 kworker4 kernel: [111176.800926] slab_reclaimable 736952
Jul 10 13:52:30 kworker4 kernel: [111176.800926] slab_unreclaimable 146752
Jul 10 13:52:30 kworker4 kernel: [111176.800926] slab 883704
Jul 10 13:52:30 kworker4 kernel: [111176.800926] workingset_refault_anon 0
Jul 10 13:52:30 kworker4 kernel: [111176.800926] workingset_refault_file 0
Jul 10 13:52:30 kworker4 kernel: [111176.800926] workingset_activate_anon 0
Jul 10 13:52:30 kworker4 kernel: [111176.800926] workingset_activate_file 0
Jul 10 13:52:30 kworker4 kernel: [111176.800926] workingset_restore_anon 0
Jul 10 13:52:30 kworker4 kernel: [111176.800927] Tasks state (memory values in pages):
Jul 10 13:52:30 kworker4 kernel: [111176.800928] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jul 10 13:52:30 kworker4 kernel: [111176.800928] [ 293427] 65535 293427      249        1    28672        0          -998 pause
Jul 10 13:52:30 kworker4 kernel: [111176.800931] [ 293470]  1001 293470  1164985   787981  6496256        0           869 java
Jul 10 13:52:30 kworker4 kernel: [111176.800933] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-2cb8ea27c355aae5a717ce8c7ffab53f627ac0ce73f7349a0a0c009c6bb80752.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod45d3f529_6498_42a7_a26c_b255967b2533.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod45d3f529_6498_42a7_a26c_b255967b2533.slice/cri-containerd-2cb8ea27c355aae5a717ce8c7ffab53f627ac0ce73f7349a0a0c009c6bb80752.scope,task=java,pid=293470,uid=1001
Jul 10 13:52:30 kworker4 kernel: [111176.800941] Memory cgroup out of memory: Killed process 293470 (java) total-vm:4659940kB, anon-rss:3138004kB, file-rss:13920kB, shmem-rss:0kB, UID:1001 pgtables:6344kB oom_score_adj:869
Jul 10 13:52:30 kworker4 kernel: [111176.800991] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod45d3f529_6498_42a7_a26c_b255967b2533.slice/cri-containerd-2cb8ea27c355aae5a717ce8c7ffab53f627ac0ce73f7349a0a0c009c6bb80752.scope are going to be killed due to memory.oom.group set
Jul 10 13:52:30 kworker4 kernel: [111176.800995] Memory cgroup out of memory: Killed process 293719 (java) total-vm:4659940kB, anon-rss:3138176kB, file-rss:14392kB, shmem-rss:0kB, UID:1001 pgtables:6344kB oom_score_adj:869
Jul 10 13:52:30 kworker4 containerd[1054]: time="2024-07-10T13:52:30.229521112Z" level=info msg="TaskOOM event container_id:\"2cb8ea27c355aae5a717ce8c7ffab53f627ac0ce73f7349a0a0c009c6bb80752\""
Jul 10 13:52:30 kworker4 systemd[1]: cri-containerd-2cb8ea27c355aae5a717ce8c7ffab53f627ac0ce73f7349a0a0c009c6bb80752.scope: Deactivated successfully.
Jul 10 13:52:30 kworker4 systemd[1]: cri-containerd-2cb8ea27c355aae5a717ce8c7ffab53f627ac0ce73f7349a0a0c009c6bb80752.scope: Consumed 1.333s CPU time.
Jul 10 13:52:30 kworker4 systemd[1]: run-containerd-io.containerd.runtime.v2.task-k8s.io-2cb8ea27c355aae5a717ce8c7ffab53f627ac0ce73f7349a0a0c009c6bb80752-rootfs.mount: Deactivated successfully.
Jul 10 13:52:30 kworker4 containerd[1054]: time="2024-07-10T13:52:30.859741550Z" level=info msg="shim disconnected" id=2cb8ea27c355aae5a717ce8c7ffab53f627ac0ce73f7349a0a0c009c6bb80752 namespace=k8s.io
Jul 10 13:52:30 kworker4 containerd[1054]: time="2024-07-10T13:52:30.859867500Z" level=warning msg="cleaning up after shim disconnected" id=2cb8ea27c355aae5a717ce8c7ffab53f627ac0ce73f7349a0a0c009c6bb80752 namespace=k8s.io
Jul 10 13:52:30 kworker4 containerd[1054]: time="2024-07-10T13:52:30.859893980Z" level=info msg="cleaning up dead shim" namespace=k8s.io
Jul 10 13:52:31 kworker4 kubelet[2861]: I0710 13:52:31.580868    2861 scope.go:117] "RemoveContainer" containerID="2cb8ea27c355aae5a717ce8c7ffab53f627ac0ce73f7349a0a0c009c6bb80752"

Solution

  • Following logs from kubectl describe po cassandra-0 command suggest that pod is able to start but with some error

    Normal Started 15s (x2 over 28s) kubelet Started container cassandra 
    Warning BackOff 3s kubelet Back-off restarting failed container cassandra in pod cassandra-0_default
    

    This error is caused by mismatch in memory maxHeapSize and resources allocation for pod.

    Please edit your helm values file and allocate following resources

    resources:
      requests:
        cpu: 2
        memory: 5Gi
      limits:
        cpu: 3
        memory: 6Gi
    

    This should provide enough memory for cassandra pod.