Several days ago I faced a problem when my nodes were rebooting constantly
My stack:
1 master, 2 workers k8s-cluster built with kubeadm (v1.17.1-00)
Ubuntu 18.04 x86_64 4.15.0-74-generic
Flannel cni plugin (v0.11.0)
Rook (v1.2) cephfs for storage. Ceph was deployed in the same cluster, where my application lives
I was able to run ceph cluster, but when I tried to deploy my application, that was using my rook-volumes, suddenly my pods were starting to die
I got this message when I used kubectl describe pods/name
command:
Pod sandbox changed, it will be killed and re-created
In the k8s events I got:
<Node name> has been rebooted
After some time node comes to life but eventually dies in 2-3 minutes.
I tried to drain my node and connect back to my cluster but after that some another node was getting this error.
I looked into the system error logs of a failed node by command journalctl -p 3
.
And found, that logs were flooded with these messages: kernel: cache_from_obj: Wrong slab cache. inode_cache but object is from ceph_inode_info
.
After googling this problem, I found this issue: https://github.com/coreos/bugs/issues/2616
It turned out, that cephfs just doesn't work with some versions of Linux kernel!! For me neither of these worked:
Cephfs doesn't work with some versions of Linux kernel. Upgrade your kernel. I finally got it working on Ubuntu 18.04 x86_64 5.0.0-38-generic
Github issue, that helped me: https://github.com/coreos/bugs/issues/2616
This is indeed a tricky issue, I was struggling to find a solution, and I spent A LOT of time trying to understand what was happening. I hope this information will help some one, cause there is not so much information on google.