shaderhlsldirect3d9

How can I optimise an SM3 HLSL pixel shader by only executing complex code for some pixels?


I have a really complex HLSL shader doing tons of texture reads, using shader model 3 in Direct3D9. The complex code is only used at some pixels so I put an if-statement around that block of code. To my surprise this gives no performance gain at all. If I use clip(-1) instead I do see an enormous performance boost, so this shader is indeed the bottleneck of my program. Why doesn't the branching improve my performance without the clip(-1) line?

I found this topic: How much performance do conditionals and unused samplers/textures add to SM2/3 pixel shaders? This topic states that in shader model 3 it is possible to optimise with branching, but the performance is that of the worst of each batch of pixels. In may case the slow branch is taken mostly at the edges of the screen and the fast branch is mostly at the centre of the screen. I think this means that batches of pixels will generally take the same branch, so I would expect a performance gain this way.

In pseudo-code the pixel shader looks like this:

float4 colour = tex2D(texture, uv);
if (colour.a < 0.5f)
{
    //I only get a performance boost if I replace this line with clip(-1);
    oColour = colour;
}
else
{
    complexSlowCodeWithTonsOfTextureReadsGoesHere;
    oColour = result;
}
oColour *= 2;

This gives me the exact same performance as when I remove the branching and always use the code in the slow else-branch. If I replace the fifth line with clip(-1) I see an enormous performance boost (and a mostly black screen) so the if-statement is actually functioning.

Am I doing something wrong here or is it not possible to optimise a shader like this in shader model 3?


Solution

  • The problem is that your if will be flattened (both executed, result of the wrong branch discarded), because you're using gradient functions like tex2D in one of your branches (doc). You should see the performance gain if you remove those functions from your branches or replace them with non-gradient functions like tex2Dlod or tex2Dgrad. The compiler would help to find the problematic lines, if you add [branch] before your if. This will hint the compiler that you want a real branching if, which will fail at compilation if you're using gradient functions.

    As far as my experience goes, the gpu computes the output with 2x2 fragments. This is needed to compute the right miplevel to use for the texture lookup, wherefore the information of the neighbours is needed. This prevents the tex2D functions from branched away, because they are needed of the adjacent operations. If you give the gpu the needed information by passing the miplevel the other fragments aren't needed anymore, so the branch can be skipped in real.