rebootsuspendpower-off

NVIDIA DevBox with Ubuntu 16.04 and 4.4.0-137-generic kernel randomly reboots and automatically shuts down overnight


I've recently stated using an NVIDIA DevBox that has an ASUS bios, with ther kernel version and ubuntu version mentioned above. For some reasons the machine can't really be left on overnight, as it is usual with other laptop and/or computer machines: where you can just leave it on it will lock itself after a couple of minutes and/or go into sleep mode -- and the next day once you move your mouse or type something in your keyboard the computer 'unsuspends' or wakes up and you have all your programs on and running just how you left them the previous day.

For some strange reason, this hasn't been happening with this machine. There was a previous user before me who hasn't touched the machine in about a year, so it is possible that he/she might have done some sort of configuration with regards to power savings, but everything looks good when I check the power option in my machine (I have it for suspend -- 1 hour, and lock 1 hour). I guess the funny thing I've noticed is that if I come back after lunch and the machine is locked/suspended, it get's back in the session without any problems, but if I leave it overnight, then I arrive the next day and the machine has automatically turned itself off. The building is locked so it's not possible for someone else to physically hit the power off button overnight, and I've also checked the history command from the other user (we both have admin privileges, and he doesn't use the computer) to check for remote access shutdowns and that doesn't pop up either.

I've read in a couple of places that it could potentially be a heating issue due to poor or broken power supply, but how can I check that this is the case? I have the psensor app, but that only seems to register temperatures in real-time without saving them to a file where I can check what the temperature was of any of the graphics cards (there are 4) or motherboard.

What is another way to diagnose the automatic shutdown of the machine? How can I know if it's a heating issue or a faulty power supply? Or potentially a kernel issue? The machine has no real intense programs installed for now (its almost new) except for the NVIDIA drivers that I'm quite experienced with installing, so maybe I can consider a fresh Ubuntu install? -- though this is pretty much pointless if there is a hardware issue

Other details:

The NVIDIA drivers are correctly installed. The driver got bugged and the machine responded pretty badly when I forced the following command and the machine was on for 2 consecutive days (which should be a breeze for these machines), until it had a hard time being on for more than 5 minutes after 2 consecutive random reboots in the middle of the night:

$ unset autologoff

I had to reinstall the drivers later correctly (and set the autolog option back on), and the system went back to its current state where it "needs" to shut itself off if its not doing anything for more than 24 hours (not doing anything as in it is not receiving human input, but backend processes may potentially still be running).

I added the pci=noaer in booting after finding out that the machine was giving me this error: https://askubuntu.com/questions/771899/pcie-bus-error-severity-corrected

Output of :

$ cat /proc/cmdline

is

BOOT_IMAGE=/boot/vmlinuz-4.4.0-137-generic.efi.signed root=UUID=569dd2ad-c5a6-4ae4-a167-f849b8f6ae9e ro quiet splash pci=noaer vt.handoff=7

Solution

  • Problem was fixed by uploading the system to Ubuntu 18.04. The root of the bug was never found, but I suspect it had to do with the kernel potentially not being a good match with the graphics cards, the BIOS and the 16.04 Ubuntu version.