kubernetesprometheuskubernetes-podcadvisor

container_memory_rss relation with node memory used


I'm trying to make sense of container_memory_rss or container_memory_working_set_bytes with respect to node_memory_used i.e (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)

Here what I meant

PROMQL 1:

sum(container_memory_rss) by (instance) / 1024 / 1024 / 1024

{instance="172.19.51.8:10250"}        7.537441253662109

PROMQL 2:

sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) by (instance) / 1024 / 1024 / 1024

{instance="172.19.51.8:9100"}         2.2688369750976562

PROMQL 3:

sum(container_memory_working_set_bytes) by (instance) / 1024 / 1024 / 1024

{instance="172.19.51.8:10250"}        9.285114288330078

PROMQL 4:

sum(node_memory_MemAvailable_bytes) by (instance) / 1024 / 1024 / 1024

{instance="172.19.51.8:9100"}         13.356605529785156

So if a Pod always runs on a Node. I fail to understand why is container_memory_rss or container_memory_working_set_bytes is more than node_memory_used

i.e values for PROMQL 1 and PROMQL 3 are way more than the value of PROMSQL 2 i.e the memory used by the node.

I don't know if I'm correct shouldn't the pod / container rss should always be <= nodes memory used (even if no default resource limit is set)


Solution

  • tl;dr

    Use container name filter (container!="") to exclude totals:

    sum(container_memory_rss{container!=""}) by (instance) / 2^30
    

    Explanation

    If you ran the first query grouping results by container name, you would have noticed that most of the usage comes from a container without a name:

    sort_desc(sum(container_memory_rss{instance="ip-192-168-104-46"}) by (name)) / 2^30
    
    {}                          3.9971389770507812
    {name="prometheus"}         0.6084518432617188
    {name="cluster-autoscaler"} 0.04230499267578125
    

    Actually there are several entries without name but they all have an id:

    sort_desc(sum(container_memory_rss{instance="ip-192-168-104-46"}) by (id)) / 2^30
    
    # these do not have a container name
    {id="/"}                                1.1889266967773438
    {id="/kubepods"}                        0.900482177734375
    {id="/kubepods/burstable"}              0.6727218627929688
    {id="/system.slice/docker.service"}     0.07495498657226562
    {id="/system.slice/kubelet.service"}    0.060611724853515625
    
    # and this is an example id of a real container which has a name label
    {id="/kubepods/burstable/pod562495f9-afa6-427e-8435-016c2b500c74/e73975d90b66772e2e17ab14c473a2d058c0b9ffecc505739ee1a94032728a78"} 0.6027107238769531
    

    These are accumulated values for each cgroup. cAdvisor takes the stats from cgroups and if you looks at them, you will find familiar entities:

    # systemd-cgls -a
    ├─kubepods
    │ ├─podc7dfcc4e-74fc-4469-ad56-c13fe5a9e7d8
    │ │ ├─61a1a58e47968e7595f3458a6ded74f9088789a865bda2be431b8c8b07da1c6e
    │ │ └─d47601e38a96076dd6e0205f57b0c365d4473cb6051eb0f0e995afb31143279b
    │ ├─podfde9b8ca-ce80-4467-ba05-03f02a14d569
    │ │ ├─9d3783df65085d54028e2303ccb2e143fecddfb85d7df4467996e82691892176
    │ │ └─47702b7977bed65ddc86de92475be8f93b50b06ae8bd99bae9710f0b6f63d8f6
    │ ├─burstable
    │ │ ├─pod9ff634a5-fd2a-42e2-be27-7e1028e96b67
    │ │ │ ├─5fa225aad10bdc1be372859697f53d5517ad28c565c6f1536501543a071cdefc
    │ │ │ └─27402fed2e4bb650a6fc41ba073f9994a3fc24782ee366fb8b93a6fd939ba4d3
    

    If you sum up all direct children of, say kubepods, you will get the same value kubepods has. Because of these totals sum(container_memory_rss) by (instance) shows several times the actual resource utilisation.

    The solution is just to filter out any values without a container name. You can either do that when querying, as in the example at the top, or configure Prometheus with relabel_config to drop such metrics at the scrape time.