c++vulkan

Vulkan command execution seems to only finish when corresponding fence object is checked


In my application Vulkan validation layers complained that fences and semaphores could not be destroyed, as they were still in use.
The message was "somewhat weird" as it claimed e.g.

Validation Error: [ VUID-vkDestroyFence-fence-01120 ] | MessageID = 0x5d296248 | vkDestroyFence(): fence can't be called on VkFence 0xcb3ee80000000007[] that is currently in use by VkFence 0xcb3ee80000000007[]

(basically telling me the fence would be "in use by itself"), but I assume, that is more a weird way of formatting the message and has no "deeper" meaning. Essentially the fence is just still in use.
The exact same message was thrown for "all" my fences and binary semaphores in place (one fence, two semaphores, only one queue was used for "everything").

After some debugging I figured out that this only happened if the cleanup did not check neither the fence for the submitted commands nor waiting fo the queue to become idle.
Both waiting for the fence or simply waiting for the queue was sufficient.

But this could not be a "pure timing issue" of doing the destruction while the command issues was still running as both command submission and destruction were triggered manually with a delay way beyond good and evil for a GPU.
The submitted commands had to be "long done" when destruction occurred (I am not "that fast"...😂).

I then figured out that already simply checking the status of the fence via vkGetFenceStatus() healed the problem, i.e. a vkGetFenceStatus() done before destruction caused the validation layer to be silent.

This seems to indicate that there is some "lazy" evaluation done on the GPU to not end the submitted commands as long as the CPU "does not ask" directly or indirectly via checking on the fence or waiting for the queue to become idle.

A little I feel reminded of Schrödingers cat, which only dies if you take a look...

Is this behavior to be expected?
I could not find anything in the official documentation about it?!


Solution

  • Validation layers are not "the GPU"; they're external code which you use to ensure that you are using the Vulkan API correctly.

    The API says that you cannot destroy a fence while it is in use. This means that, in order for your destruction call to be valid, your code must know that the fence is no longer in use.

    As such, the validation layer is programmed so that you must do something to indicate that your code knows the fence is no longer in use. And doing other stuff for some period of time isn't it. You believe that you've spent enough CPU time for the GPU to be finished with that batch. But you don't know that it's finished.

    And that's what the validation layer cares about. It wants your code to be certainly correct, not pobably correct.

    As such, the validation layer requires that you do something with the Vulkan API that synchronizes with the completion of the batch containing the fence. Checking the fence's status is one of those things, and it's generally the one you should use.