[SOLVED] Primitive assembly performance

Primitive assembly performance

I am currently experimenting with some terrain/height-map rendering. Each tile of that terrain is rendered with a VBO and IBO. For being able to draw subtiles easily, I ordered the indices using Morton coding and at this point some questions about primitive assembly came to my mind.

Primitive assembly happens after vertex processing, but

How does the GPU know which vertices to process? Maybe some of them are not indexed. Do they still get processed?
How does the GPU know in which order the vertices have to be processed? Maybe a triangle uses the first and the last vertex of the VBO, so primitive assembly stage would have to wait until the whole VBO is processed?

Solution

How does the GPU know which vertices to process? Maybe some of them are not indexed. Do they still get processed?

Your index buffer and the range of vertices in your draw call determine which vertices are processed and also define the order of use during primitive assembly. Any vertex not covered in that range of vertices/indices does not need to be processed.

How does the GPU know in which order the vertices have to be processed? Maybe a triangle uses the first and the last vertex of the VBO, so primitive assembly stage would have to wait until the whole VBO is processed?

The order vertices were processed in is not particularly important by the time you arrive at primitive assembly; there is no order-dependence at the vertex shader level (the vertices could have all been processed in parallel for all you know). All you need to know is that the results of a vertex shader are appended to a special buffer called the post-transform cache.

A Geometry Shader (programmable primitve assembly) will fetch its input vertices from the post-transform cache, and it will do that in a specific order. Given a traditional FIFO implementation of the post-transform cache, order dictates cache replacement and strip-based primitives tend to maximize the cache hit rate during primitive assembly. A cache miss in the post-transform cache would cause a stall, but only for the vertices that were not in cache - it is not going to stall while every vertex in your vertex buffer is unnecessarily processed.

The good news is most modeling software these days outputs vertices in a cache efficient order and the cache is larger and smarter than ever, so this is not something you often have to worry about. 15 years ago vertex caching was a very hot topic and everyone you talked to would have their own theory regarding what worked best, now it is largely a waste of time and strip-order is probably as far as you want to take it.