error-handlingvulkan

Should I expect "device lost" conditions as normal under Vulkan?


I'm asking because I wonder how robust I should make my programs against device losses.

Should I only expect devices to be lost in the case of, say, hardware errors, driver bugs, improper API usage or non-terminating shader programs; or should I also expect device loss in such cases as, say, suspending and resuming my laptop, minimizing the application window, or just randomly because the implementation felt like it?


Solution

  • It's unfortunately going to vary by GPU, driver, and OS, which leads to the somewhat vague spec wording that krOoze quoted:

    A logical device may become lost because of hardware errors, execution timeouts, power management events and/or platform-specific events.

    For reference, there is nothing in the Android OS itself that would require a device lost -- e.g. it doesn't force a device-lost when an app goes into the background or the screen is turned off.

    But it's likely that some driver/hardware combinations will report a device lost error if there is a GPU exception (or reset), unless the driver can guarantee that nothing from your VkDevice could have been affected. That's a surprisingly difficult guarantee to make, e.g. if your queues weren't running at the time the problem occurred, but there still might have been some of your data in dirty cache lines and the reset invalidates those lines instead of writing them back to memory, your data will be corrupted. An exception/reset can be caused by hardware or driver bugs, or by any app on the system hitting a watchdog timeout (infinite loop in shader is the easy example, but even making progress but simply taking too long can happen).

    In practice, these should be fairly rare events, and I believe (without data) that these days it's primarily caused by hotplug (rare) or misbehaving hardware/driver/app events rather than more routine things like device sleep.

    Since testing your recovery code is going to be difficult and it'll therefore likely be buggy, my recommendation would be to just do something heavy-handed but simple, like saving application state and either restarting your app automatically, or quitting and asking the user to restart. Depending on what you're building, it might be reasonable to do something more sophisticated like tearing down and restarting+restoring your renderer system without taking down the rest of your app.