unity-game-engineoptimizationshader

How to optimize palette cycle in Unity shader?


I have a problem with badly optimized palette cycling function of the background shader:

Shader "Background/Earthbound"
{
// 6 things to solve
// X palette cycling
// X background scrolling
// X horizontal oscillation
// X vertical oscillation
// X interleaved oscillation
// X transparency

Properties
{
    [Toggle] _Blend("Blend?", int) = 0

    [Header(Texture A)]
    _TexA ("Texture", 2D) = "white" {}      // ensure "Repeat" wrap mode
    _PaletteA("Palette Cycle", 2D) = "white" {} // ensure "Clamp" wrap mode
    [Enum(None,0,Horizontal,1,Interleaved,2,Vertical,3)] _OscillationVariantA("Oscillation Variant", int) = 0
    _ScrollDirXA("Scroll Direction X", float) = 1
    _ScrollDirYA("Scroll Direction Y", float) = 1
    _ScrollSpeedA("Scroll Speed", float) = 0
    _OscillationSpeedA("Oscillation Speed", float) = 1
    _OscillationAmplitudeA("Oscillation Amplitude", int) = 32
    _OscillationDelayA("Oscillation Delay", int) = 1

    [Header(Texture B)]
    _TexB("Texture", 2D) = "white" {}
    _PaletteB("Palette Cycle", 2D) = "white" {}
    [Enum(None,0,Horizontal,1,Interleaved,2,Vertical,3)] _OscillationVariantB("Oscillation Variant", int) = 0
    _ScrollDirXB("Scroll Direction X", float) = 1
    _ScrollDirYB("Scroll Direction Y", float) = 1
    _ScrollSpeedB("Scroll Speed", float) = 0
    _OscillationSpeedB("Oscillation Speed", float) = 1
    _OscillationAmplitudeB("Oscillation Amplitude", int) = 32
    _OscillationDelayB("Oscillation Delay", int) = 1
}
SubShader
{
    Tags { "RenderType"="Opaque" }
    LOD 100
        ...
        ...
        ...
        // palette cycling (too expensive right now...)
        float4 paletteCycle(float4 inCol, sampler2D paletteCycle, float paletteCount)
        {
            float4 outCol = inCol;

            int paletteIndex = -1;
            for (int i = 0; i < paletteCount; i++)
            {
                if (inCol.a == tex2D(paletteCycle, float2(i / paletteCount, 0)).a) // match alpha values (greyscale)
                {
                    paletteIndex = i;
                }
            }
            if (paletteIndex >= 0)
            {
                int paletteOffset = (paletteIndex + _Time.y * 12) % paletteCount;
                outCol = tex2D(paletteCycle, float2(paletteOffset / paletteCount, 0));
            }
            return outCol;
        }
     }

I use 2 grayscale sprites for the background animation - main bg (256x256) with "Repeat" option and palette (17x1) with "Clamp" option.

How can I optimize it?

Unity Version: 2020.


Solution

  • 1-st for loops are bad, at least untill you dont do just a small amount of iterations, you can limit them with [unroll(max number of iterations)], dont pass palleteCount in function, use it as constant, will be more clear for compiler to optimize loop as constant one

    2-nd there was a comment talking about breaks in loop, I respect desire to help, but tbh, that not the case and won't help, any fast-path optimization will fail on gpu (there are another examples but on workgroup scale in compute shaders only). GPU is SIMD device and you should measure the end of the task by the slowest possible thread.

    3-d texture sampling isnt fast especially, when you multismaple it manually like in your example. You also do use some sampler2D, just never combine it with texture, use separate SamplerState and Texture2D, because number of samplers is always limited (around 4-6), so just for habbit at least use better way.

    4-th sample texture LOD. If its background, then your LOD will somehow clearly depend on your screen resoultion. That means tex2D is not optimal as its trying to calculate best 2! mip levels, sample them and interpolate, thats not gonna work well. use pointClampSampler and Texture2D.SampleLevel()

    5-th reduce your texture format. Use the smallest possible one, if it can be just a mask where you bind specific color to one of 256 values, then that will be perfect, you will pack it into 8 bit one chanel texture and it will be efficient to sample.

    6-th less comparassions, they are just not good, wont change much, but still, if number is -1 then you should check != -1 not >= 0. I replaced all ifs with ? : expressions, because that will make clearer whats actually will be happening on GPU.

    I made a some of my thoughts here, check it out, may be will be more clear. Also you have some time dependency not sure how it should work in your case, and why dont increase speed of time instead of doing iterations??? why do you implement search algo inside of shader, may be you can pass something from outside, like frame specific and check only 1 texture?

    Sorry if Imade some typos or whatever mistakes:

    ...
    SamplerState pointClampSampler;
    uint paletteCount;
    
    float4 paletteCycle(float4 inCol, Texture2D paletteCycle, uint lod) {
        float4 outCol = inCol;
    
        int paletteIndex = -1;
        // 8 texture samples is already a big deal, so no more will be optimal
        [unroll(8)]
        for (int i = 0; i < paletteCount; i++) {
            paletteIndex = (inCol.a == paletteCycle.SampleLevel(pointClampSampler, float2(i / paletteCount, 0), lod).a)?  i : paletteIndex;// match alpha values (greyscale) might fail if you calculate inColor dynamicaly and it differs a bit
        }
    
        int paletteOffset = (paletteIndex + _Time.y * 12) % paletteCount;
        outCol = tex2D(paletteCycle, float2(paletteOffset / paletteCount, 0));
        return (paletteIndex != -1)? outCol : inColor;
    }