Does the Indexed Instancing in the D3D12Bundles sample actually improve performance?

First Question

The code snippet in the docs calls DrawIndexdeInstanced in a for loop

for (UINT i = 0; i < m_cityRowCount; i++) {
    for (UINT j = 0; j < m_cityColumnCount; j++) {
        pCommandList->DrawIndexedInstanced(numIndices, 1, 0, 0, 0);
    }
}

But the API

void DrawIndexedInstanced(
  [in] UINT IndexCountPerInstance,
  [in] UINT InstanceCount,
  [in] UINT StartIndexLocation,
  [in] INT  BaseVertexLocation,
  [in] UINT StartInstanceLocation
);
void DrawInstanced(
  [in] UINT VertexCountPerInstance,
  [in] UINT InstanceCount,
  [in] UINT StartVertexLocation,
  [in] UINT StartInstanceLocation
);

has StartInstanceLocation and InstanceCount parameters which I assume affects of offsetting by InstanceIndex*StartInstanceLocation.

So are the following equivalent?

DrawIndexedInstanced(100, 2, 0,   0, 100);
//vs
DrawIndexedInstanced(100, 1,   0, 0,   0);
DrawIndexedInstanced(100, 1, 100, 0,   0);

DrawInstanced(100, 2,   0, 100);
//vs
DrawInstanced(100, 1,   0,   0);
DrawInstanced(100, 1, 100,   0);

Second Question

How does instancing improve performance in the D3D12Bundles sample referred to by the docs? They call SetPipelineState in between each each instance. And the constant buffer used for the g_mWorldViewProj in the vertex shader also changes each instance. How does anything get reused?

for (UINT i = 0; i < m_cityRowCount; i++) {
    for (UINT j = 0; j < m_cityColumnCount; j++) {
        // Alternate which PSO to use; the pixel shader is different on 
        // each just as a PSO setting demonstration.
        pCommandList->SetPipelineState(usePso1 ? pPso1 : pPso2);
        usePso1 = !usePso1;

        // Set this city's CBV table and move to the next descriptor.
        pCommandList->SetGraphicsRootDescriptorTable(2, cbvSrvHandle);
        cbvSrvHandle.Offset(cbvSrvDescriptorSize);

        pCommandList->DrawIndexedInstanced(numIndices, 1, 0, 0, 0);
    }
}

Solution

The canonical sample for instancing is InstancingFX11 (rather than D3D12Bundles linked to by the docks of DrawIndexedInstanced() which just benefits from the indexing but not the instancing)

The writer of the InstantingFX11 sample wrote some comments on how use instancing properly

Pay attention to the code defining the buffer in Instancing.cpp as this basically implements 2 vertex buffers. 1 for the geometry and the other for the instance data (matrices in this case). Adding the 2nd buffer is like adding another for loop around the draw call (but alot more efficient).

Your instancing example only discusses adding a instanceid system variable. Instancing requires a 2nd vertex buffer attached bound to the draw context which contains unique data such as say World translation matrices. You then update your signature with the definition of the 2nd buffer, defining also in your HLSL code that it will receive instancing data also. Your example is a single buffer version where you may use a constant buffer and the instance id to look up with in that. This is less efficient.

Looking up the data in the vertex shader means that the data can not be inlined by the driver. Any precaching/setup on the gpu wave front is wasted. For each vertex you visit the hardware now goes and looks up the related array entry rather than it being loaded once by the hardware and passed in as an argument of your vertex shader.

The key then to performance is that the first vertex buffer (containing the actually vertices) stays the same for each instance. While only the second vertex buffer (containing the different World translation matrices) strides between instances.

Re: first question

Seems like I was completely wrong about the first question. The two are not equivalent. In the INPUT_ELEMENT_DESC structure the member InstanceDataStepRate

The number of instances to draw using the same per-instance data before advancing in the buffer by one element. This value must be 0 for an element that contains per-vertex data (the slot class is set to D3D11_INPUT_PER_VERTEX_DATA).

means that the D3D11_INPUT_PER_VERTEX_DATA vertex buffers will not differ between instances. So sequential instanced draw calls do not combine into one instance, whoops.