c++winapitbbpplparallel.for

Parallelizing a for loop gives no performance gain


I have an algorithm which converts a bayer image channel to RGB. In my implementation I have a single nested for loop which iterates over the bayer channel, calculates the rgb index from the bayer index and then sets that pixel's value from the bayer channel. The main thing to notice here is that each pixel can be calculated independently from other pixels (doesn't rely on previous calculations) and so the algorithm is a natural candidate for paralleization. The calculation does however rely on some preset arrays which all threads will be accessing in the same time but will not change.

However, when I tried parallelizing the main forwith MS's cuncurrency::parallel_for I gained no boost in performance. In fact, for an input of size 3264X2540 running over a 4-core CPU, the non parallelized version ran in ~34ms and the parallelized version ran in ~69ms (averaged over 10 runs). I confirmed that the operation was indeed parallelized (3 new threads were created for the task).

Using Intel's compiler with tbb::parallel_for gave near exact results. For comparison, I started out with this algorithm implemented in C# in which I also used parallel_for loops and there I encountered near X4 performance gains (I opted for C++ because for this particular task C++ was faster even with a single core).

Any ideas what is preventing my code from parallelizing well?

My code:

template<typename T>
void static ConvertBayerToRgbImageAsIs(T* BayerChannel, T* RgbChannel, int Width, int Height, ColorSpace ColorSpace)
{
        //Translates index offset in Bayer image to channel offset in RGB image
        int offsets[4];
        //calculate offsets according to color space
        switch (ColorSpace)
        {
        case ColorSpace::BGGR:
            offsets[0] = 2;
            offsets[1] = 1;
            offsets[2] = 1;
            offsets[3] = 0;
            break;
        ...other color spaces
        }
        memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
        parallel_for(0, Height, [&] (int row)
        {
            for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
            {
                auto offset = (row%2)*2 + (col%2); //0...3
                auto rgbIndex = bayerIndex * 3 + offsets[offset];
                RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
            }
        });
}

Solution

  • First of all, your algorithm is memory bandwidth bounded. That is memory load/store would outweigh any index calculations you do.

    Vector operations like SSE/AVX would not help either - you are not doing any intensive calculations.

    Increasing work amount per iteration is also useless - both PPL and TBB are smart enough, to not create thread per iteration, they would use some good partition, which would additionaly try to preserve locality. For instance, here is quote from TBB::parallel_for:

    When worker threads are available, parallel_for executes iterations is non-deterministic order. Do not rely upon any particular execution order for correctness. However, for efficiency, do expect parallel_for to tend towards operating on consecutive runs of values.

    What really matters is to reduce memory operations. Any superfluous traversal over input or output buffer is poison for performance, so you should try to remove your memset or do it in parallel too.

    You are fully traversing input and output data. Even if you skip something in output - that doesn't mater, because memory operations are happening by 64 byte chunks at modern hardware. So, calculate size of your input and output, measure time of algorithm, divide size/time and compare result with maximal characteristics of your system (for instance, measure with benchmark).

    I have made test for Microsoft PPL, OpenMP and Native for, results are (I used 8x of your height):

    Native_For       0.21 s
    OpenMP_For       0.15 s
    Intel_TBB_For    0.15 s
    MS_PPL_For       0.15 s
    

    If remove memset then:

    Native_For       0.15 s
    OpenMP_For       0.09 s
    Intel_TBB_For    0.09 s
    MS_PPL_For       0.09 s
    

    As you can see memset (which is highly optimized) is responsoble for significant amount of execution time, which shows how your algorithm is memory bounded.

    FULL SOURCE CODE:

    #include <boost/exception/detail/type_info.hpp>
    #include <boost/mpl/for_each.hpp>
    #include <boost/mpl/vector.hpp>
    #include <boost/progress.hpp>
    #include <tbb/tbb.h>
    #include <iostream>
    #include <ostream>
    #include <vector>
    #include <string>
    #include <omp.h>
    #include <ppl.h>
    
    using namespace boost;
    using namespace std;
    
    const auto Width = 3264;
    const auto Height = 2540*8;
    
    struct MS_PPL_For
    {
        template<typename F,typename Index>
        void operator()(Index first,Index last,F f) const
        {
            concurrency::parallel_for(first,last,f);
        }
    };
    
    struct Intel_TBB_For
    {
        template<typename F,typename Index>
        void operator()(Index first,Index last,F f) const
        {
            tbb::parallel_for(first,last,f);
        }
    };
    
    struct Native_For
    {
        template<typename F,typename Index>
        void operator()(Index first,Index last,F f) const
        {
            for(; first!=last; ++first) f(first);
        }
    };
    
    struct OpenMP_For
    {
        template<typename F,typename Index>
        void operator()(Index first,Index last,F f) const
        {
            #pragma omp parallel for
            for(auto i=first; i<last; ++i) f(i);
        }
    };
    
    template<typename T>
    struct ConvertBayerToRgbImageAsIs
    {
        const T* BayerChannel;
        T* RgbChannel;
        template<typename For>
        void operator()(For for_)
        {
            cout << type_name<For>() << "\t";
            progress_timer t;
            int offsets[] = {2,1,1,0};
            //memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
            for_(0, Height, [&] (int row)
            {
                for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
                {
                    auto offset = (row % 2)*2 + (col % 2); //0...3
                    auto rgbIndex = bayerIndex * 3 + offsets[offset];
                    RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
                }
            });
        }
    };
    
    int main()
    {
        vector<float> bayer(Width*Height);
        vector<float> rgb(Width*Height*3);
        ConvertBayerToRgbImageAsIs<float> work = {&bayer[0],&rgb[0]};
        for(auto i=0;i!=4;++i)
        {
            mpl::for_each<mpl::vector<Native_For, OpenMP_For,Intel_TBB_For,MS_PPL_For>>(work);
            cout << string(16,'_') << endl;
        }
    }