[SOLVED] Fastest way to get Storage Buffer to host from compute shader in Vulkan

Fastest way to get Storage Buffer to host from compute shader in Vulkan

I have a large Storage Buffer of ~4.6MB I am sending via compute buffer and then retrieving at the host at the end of the render loop. I was hoping someone could provide guidance on a possible optimal way of going about this? The performance of the app without the host read is about 3000 FPS on my machine and 800 FPS with it.

Whilst trying to improve performance I ended up with:
2 Storage Buffers, 1 which is both host visible and host cached which is used to both read and write with the host, and a 2nd Storage Buffer which is used to store the output of the compute shader and is then copied into the 1st buffer.

At the moment my input Storage Buffer which receives data from the host has usage:

VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT

And properties:

VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT

And my second, compute buffer output buffer has usage:

VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT

And properties:

VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT

As I use the same Storage Buffer in the fragment shader, for Descriptor Sets I ended up with stage flags:

VK_SHADER_STAGE_FRAGMENT_BIT | VK_SHADER_STAGE_COMPUTE_BIT

The synchronization I am using is a Compute to Compute barrier and a Compute to Graphics barrier and the rest is from the docs

Solution

The performance of the app without the host read is about 3000 FPS on my machine and 800 FPS with it.

This is a prime example of FPS being a misleading performance metric. Using raw time is better, as it makes it much more clear what the absolute time difference is.

The compute-only process really takes 0.33ms, while the compute+transfer takes 1.25 ms. This means the difference, the cost of the transfer, is only 0.92 ms.

That's hardly unreasonable for moving 4.6MiB of data. If you look at how many MiB per second that works out to, you'll find that it's about 5GiB. Which is quite reasonable (depending on hardware) and not something you would expect to be able to be "optimized" much further.

The only tricks really left that might buy you anything is to use a dedicated transfer queue to do the copies if you're not already doing so.