rmemorycrashcluster-analysisintermittent

unpredictable memory usage by same operations with same data in R


I'm having an issue where an R function (NbCluster) crashes R, but at different points on different runs with the same data. According to journalctl, the crashes are all because of memory issues. For example:

Sep 04 02:00:56 Q35 kernel: [   7608]  1000  7608 11071962 10836497 87408640        0             0 rsession
Sep 04 02:00:56 Q35 kernel: Out of memory: Kill process 7608 (rsession) score 655 or sacrifice child
Sep 04 02:00:56 Q35 kernel: Killed process 7608 (rsession) total-vm:44287848kB, anon-rss:43345988kB, file-rss:0kB, shmem-rss:0kB
Sep 04 02:00:56 Q35 kernel: oom_reaper: reaped process 7608 (rsession), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I have been testing my code to figure out which lines are causing the memory errors, and it turns out that it varies, even using the same data. Aside from wanting to solve it, I am confused as to why this is a intermittent problem. If an object is too big to fit in memory, it should be a problem every time I run it given the same resources, right?

The amount of memory being used by other processes was not dramatically different between runs, and I always started from a clean environment. When I look at top I always have memory to spare (although I am rarely looking at the exact moment of the crash). I've tried reducing the memory load by removing unneeded objects and regular garbage collection, but this has had no discernable effect.

For example, when running NbClust, sometimes the crash occurs while running length(eigen(TT)$value) other times it happens during a call of hclust. Sometimes it doesn't crash and exits with a comparatively graceful "cannot allocate vector size" Aside from any suggestions about reducing memory load, I want to know why I am running out of memory some times but not others.

Edit: After changing all uses of hclust to hclust.vector, I have not had any more crashes during the hierarchical clustering steps. However there are still crashes going on at varying places (often during calls of eigen()). If I could reliably predict (within a margin of error) how much memory each line of my code was going to use, that would be great.


Solution

  • Modern memory management is by far not as deterministic as you seem to think it is.

    If you want more reproducible results, make sure to get rid of any garbage collection, any parallelism (in particular garbage collection running in parallel with your program!) and make sure that the process is limited in memory by a value much less than your free system memory.

    The kernel OOM killer is a measure of last resort when the kernel has overcommitted memory (you may want to read what that means), is completely out of swap storage, and cannot fulfill it's promises.

    The kernel can allocate memory that doesn't need to exist until it is first accessed. Hence, the OOM killer can occur not on allocation, but when the page is actually used.