Evaluating Mesh Shading: CPU Costs vs. GPU Gains (Vulkan API)

I recently switched a Vulkan renderer from the traditional vertex pipeline to mesh shading. On large meshes, I've experienced a significant performance drop — particularly in processing time. For example, whereas I could previously send a 20M triangle mesh to the GPU in just a couple of seconds (including loading from disk), it now takes around 15 seconds, with 13–14 seconds spent solely on processing the mesh into meshlets.

I'm not even doing anything fancy for meshlet construction yet — just looping over faces, triangulating them, and grouping them into meshlets with a maximum of 32 triangles each. Rendering performance is roughly on par with the old pipeline at the moment. That’s expected since I haven’t implemented vertex deduplication or any culling yet, which I hope will reduce render times and eventually make mesh shading outperform the traditional vertex pipeline.

That said, I’m definitely paying the price on the CPU side due to the heavy preprocessing required to convert meshes into meshlets. It would get even worse if I were to implement more advanced optimization techniques like those found in meshoptimizer, which I believe is generally recommended as an offline preprocessing tool. Unfortunately, I can’t preprocess data offline in my case.

So I’m genuinely interested in hearing how other developers are using mesh shading — and how practical they've found it to be in production, considering the cost of converting meshes into meshlets. There’s certainly room for improvement, especially by multithreading the preprocessing on the CPU, but that’s significant work. And even then, I’m not yet convinced it will match the speed of the old vertex pipeline’s preprocessing.

How do you all approach or use mesh shading in practice?

Solution

Mesh shading is only an effective technique because you can impose useful structure on your data. Creating good input structure (efficient meshlets) subsequently allows you benefit from the ability to apply higher level optimizations (efficient meshlet culling) that a traditional vertex pipeline cannot benefit from.

It's not a magic bullet that is guaranteed to be faster. You need to impose that structure to get the benefits, and that of course has a processing cost. Like most optimizations, there is the tradeoff between the cost of doing the asset processing vs the benefits of reduced geometry handling on the GPU. Where possible shift that processing offline, but if you can't you need to decide if the GPU processing savings are worth the added runtime preprocessing overhead.