Drawing with Alpha Blending, It's normal that its so slow or I'm doing something wrong?

so I'm trying to make Alpha blending and display picture, I'm using 64 big GCC 5.0.0 on CodeBlocks ( don't know which version but from 2016 )

I'm not using GPU, everything is happening on single core of 2.4 GHz CPU Rendering is sharing that CPU Core with input and program logic, but these don't take much.

I have this function below to take X and Y coordinates where to draw, and SurfaceData what to draw. There is also optional value qAlpha if we want whole picture to be semi transparent.

Ignore my dumb "out of bounds protection" ( first 4 If statements ) - unless u have some suggestion.

My question is about speed of this function, It's first time I'm doing alpha blending My program is rendering in resolution of 1360 x 768

Without drawing anything on screen ( only clearing and displaying ) i have around 4000 FPS After drawing UI and some debug text i get around 1300 FPS As soon as i draw FIRST Alpha blended picture, i lose ~500 FPS

Picture I'm using is 283x600 px big

When i try to draw 100 pictures ( one on top of another ) my FPS drops to ~20

I would like to know If I'm doing something wrong, I was really expecting this to be a lot faster.

    void qDrawPictureAlpha( int _x,int _y, qlSurfaceData &_Data, double _qAlpha=1.0 ){
        int w = _Data.RenderXSize,
            h = _Data.RenderYSize,
            t = w*h;
        if(_x<0)return;
        if(_x+w>qData.RenderXSize)return;
        if(_y<0)return;
        if(_y+h>qData.RenderYSize)return;
        
        int TargetP = _x+(_y*qData.RenderXSize)-1;
        
        unsigned int SrcPixel;
        unsigned int SrcAlpha;
        unsigned int InvAlpha;
        unsigned int TgtPixel;
        
        unsigned int *Source = &_Data.qPixel[0];
        unsigned int *Target = &qData.qPixel[0];
        Target+=TargetP;
        
        w--;t++;
        unsigned int Temp = w;
        
        for(int c=1;c<t;c++){
            SrcPixel = *Source++;
            TgtPixel = *Target;
            SrcAlpha = ((SrcPixel >> 24) & 0xFF)*_qAlpha;
            InvAlpha = 255 - SrcAlpha;
            
            *Target = 
    (((((SrcPixel >> 16) & 0xFF) * SrcAlpha) >> 8 ) + ((((TgtPixel >> 16) & 0xFF) * InvAlpha) >> 8 ) << 16 ) | 
    (((((SrcPixel >> 8) & 0xFF) * SrcAlpha) >> 8 ) + ((((TgtPixel >> 8) & 0xFF) * InvAlpha) >> 8 ) << 8 ) | 
    (((SrcPixel & 0xFF) * SrcAlpha ) >> 8 ) + (((TgtPixel & 0xFF) * InvAlpha) >> 8 );
    
            if(!Temp){Temp=w;Target+=qData.RenderXSize-w;}else{
                Temp--;Target++;
            }
        }
    }

As You could see in code I tried many things, like for example instead of doing multiplication/division by 255 im just shifting bits.

I was also thinking about PreCalculating all possible 65k pixel/alpha blends and put them to array, but It didn't really gave me much performance.

I Just want to know if i should start thinking about CUDA or something for GPU to render more, or is there something i can do to current code to be able to render at least 250 pictures on CPU with Alpha Blending.

Solution

It seems about right, considering that you are not using any tricks, SIMD, or the GPU. 283 pixels wide times 600 pixels high times 100 images times 20 fps means your one core is calculating 339,600,000 pixels per second. I count at least 29 calculations per pixel. 2.4 billion clock cycles per second, 10 billion calculations per second, not sure what more you could expect, really. That's if the compiler hasn't figured out a way to do some of it with less calculations.

If you don't have optimization turned on, turn it on and it should get a lot faster. Based on the numbers I just cited, I guess that you did already turn it on.

Your main option is to do less work. Do you really have 100 images on top of each other that change 20 times per second? I don't believe that. You only need to calculate the pixel colour when it actually changes, you know.

Do you really need to do ((Src * SrcAlpha) >> 8) + ((Dst * InvAlpha) >> 8)? Or is it perhaps faster if you do Dst + ((Src-Dst)*SrcAlpha) >> 8, with one less multiplication and one less bitshift, and possibly a little bit more rounding error?

Your other main option is to use the GPU, which is designed for this task with parallel processing and special processing units.

There's also SIMD. You can reduce the number of calculations per pixel. You should be able to process the red, green and blue of 8 pixels all at once in a single 256-bit pixel calculation. SIMD programming is fiddly.