c++image-processing optimization icc binning

How to perform fast software binning of an image in C++?

I am trying to perform fast binning of a 2D image (stored in row-major order in 1D array). The image is 12 bit, Thus, I have used the uint16_t data type.

What binning means is if done 2x2, then 2 pixels in each direction (x and y) becomes one pixel, or essentially the one super pixel contains the mean of those 4 pixels.

I have written the following code to calculate the mean of pixels and store them in superpixel while avoiding integer overflow. However, The divide by "DIVISIONFACTOR" to avoid the overflow slows the function down as the code has to perform more divisions (which are expensive compared to other arithmetic operations). The code runs significantly faster if I add values of all pixels (thus potential overflow can happen if the sum of pixels exceeds 65535) and then divide the sum for a superpixel by "DIVISIONFACTOR" only once.

To be more precise, The code runs at 0.75 ms with the overflow-avoiding code and at 0.13 ms with code that has the potential for overflow (owing to fewer division operations).

Please suggest ways to optimize further. I am using intel C++ compiler, please don't shy away from suggesting any Intel specific technique.

Thank you in advance.

Regards,

Harsh

Image Definitions: WIDTH and HEIGHT are actual image dimensions received from Camera and NX and NY are final dimensions.

#define NX 256
#define NY 256
#define WIDTH 768
#define HEIGHT 768
#define BINNINGFACTORWIDTH (WIDTH / NX)
#define BINNINGFACTORHEIGHT (HEIGHT / NY)
#define DIVISIONFACTOR (BINNINGFACTORWIDTH * BINNINGFACTORHEIGHT)

Code without overflow:

int binning(uint16_t* image, uint16_t* binnedImage){
    int i, j, k, l;

    for (j=0; j< NY; j++) {
        for (i=0;i < NX; i++) {
            binnedImage[j * NX + i] = 0;
            for (k=i * BINNINGFACTORWIDTH; k<i * BINNINGFACTORWIDTH + BINNINGFACTORWIDTH; k++) {
                for (l=j * BINNINGFACTORHEIGHT; l<j * BINNINGFACTORHEIGHT + BINNINGFACTORHEIGHT; l++) {
                    binnedImage[j * NX + i] += (image[l * HEIGHT + k] / DIVISIONFACTOR);
                }
            }
        }
    }

    return 0;
}

Code with potential of overflow:

int binning(uint16_t* image, uint16_t* binnedImage){
    int i, j, k, l;

    for (j=0; j< NY; j++) {
        for (i=0;i < NX; i++) {
            binnedImage[j * NX + i] = 0;
            for (k=i * BINNINGFACTORWIDTH; k<i * BINNINGFACTORWIDTH + BINNINGFACTORWIDTH; k++) {
                for (l=j * BINNINGFACTORHEIGHT; l<j * BINNINGFACTORHEIGHT + BINNINGFACTORHEIGHT; l++) {
                    binnedImage[j * NX + i] += image[l * HEIGHT + k];
                }
            }
            binnedImage[j * NX + i] /= DIVISIONFACTOR;
        }
    }

    return 0;
}

Full executable code:

#include <iostream>
#include <chrono>
#define NX 256
#define NY 256
#define WIDTH 768
#define HEIGHT 768
#define BINNINGFACTORWIDTH (WIDTH / NX)
#define BINNINGFACTORHEIGHT (HEIGHT / NY)
#define DIVISIONFACTOR (BINNINGFACTORWIDTH * BINNINGFACTORHEIGHT)

using namespace std;

int binning(uint16_t* image, uint16_t* binnedImage);


int main() {
    chrono::high_resolution_clock::time_point t0, t1;
    chrono::duration<double> dt;

    double totalTime = 0;
    uint64_t count = 0;
    uint32_t i;
    uint16_t *image = (uint16_t*) malloc(sizeof(uint16_t) * WIDTH * HEIGHT);

    for (i=0; i < WIDTH * HEIGHT; i++) {
        image[i] = 4095;
    }

    uint16_t *binnedIimage = (uint16_t*) calloc(sizeof(uint16_t), NX * NY);

    while(1) {
        t0 = chrono::high_resolution_clock::now();

        binning(image, binnedIimage);

        t1 = chrono::high_resolution_clock::now();

        dt = chrono::duration_cast<chrono::duration<double>>(t1 - t0);

        totalTime += dt.count();

        count += 1;

        if (count % 3000 == 0) {
            cout<<"Time (ms): "<< totalTime * 1000 / count<<"\r";
            cout.flush();
        }
    }

    return 0;
}


int binning(uint16_t* image, uint16_t* binnedImage){
    int i, j, k, l;

    for (j=0; j< NY; j++) {
        for (i=0;i < NX; i++) {
            binnedImage[j * NX + i] = 0;
            for (k=i * BINNINGFACTORWIDTH; k<i * BINNINGFACTORWIDTH + BINNINGFACTORWIDTH; k++) {
                for (l=j * BINNINGFACTORHEIGHT; l<j * BINNINGFACTORHEIGHT + BINNINGFACTORHEIGHT; l++) {
                    binnedImage[j * NX + i] += (image[l * HEIGHT + k]/ DIVISIONFACTOR);
                }
            }
        }
    }

    return 0;
}

Solution

If you don't need to keep the original image content, doing one iteration doing horizontal binning followed by another iteration doing a vertical binning can be provide a performance boost of > 50%. (Even if the source cannot hold sufficient large values, you could use a separately allocated array to store intermediate results; likely you can determine the maximum array size before starting image conversions, so you don't need to allocate memory during the run of the function itself.)

Source

 1  2  3 | 10 11 12 | ...
 4  5  6 | 13 14 15 | ...
 7  8  9 | 16 17 28 | ...
--------------------------- ...
...

Step 1

 6 | 33 | ...
15 | 42 | ...
24 | 51 | ...
--------------------------- ...
...

Step 2

45 | 126 | ...
--------------------------- ...
...

Additional recommendations

Use constexpr variables instead of preprocessor defines.
Add static_asserts, to make sure overflow doesn't happen. Also add static_asserts for stuff that could go wrong when modifying the code without much of a thought like choosing a WIDTH that isn't divisible by NX.
Use a benchmarking framework instead of manually doing the timing. Otherwise aggressive optimization could get rid of the binning call alltogether, the result is never read. Also doing the timing yourself is hard, since the compiler is allowed to reorder your logic as long as the observable outcome is the same.
Use the fast versions of integral types for intermediate results.

Here is a implementation using google benchmark.

#include <benchmark/benchmark.h>

#include <cstdlib>
#include <cstdint>
#include <iostream>
#include <type_traits>
#include <memory>

namespace
{
struct Deallocator
{
    void operator()(void* mem) const
    {
        std::free(mem);
    }
};

constexpr unsigned WIDTH = 768;
constexpr unsigned HEIGHT = 768;

constexpr unsigned NX = 256;
constexpr unsigned NY = 256;

static_assert(WIDTH % NX == 0);
static_assert(HEIGHT % NY == 0);

constexpr auto BINNINGFACTORWIDTH = WIDTH / NX;
constexpr auto BINNINGFACTORHEIGHT = HEIGHT / NY;

constexpr auto DIVISIONFACTOR = BINNINGFACTORWIDTH * BINNINGFACTORHEIGHT;

class ImageAllocationFixture : public benchmark::Fixture
{
protected:
    std::unique_ptr<uint16_t[], Deallocator> image;
    std::unique_ptr<uint16_t[], Deallocator> binnedImage;
public:
    void SetUp(const ::benchmark::State& state)
    {
        image.reset(static_cast<uint16_t*>(std::malloc(sizeof(uint16_t) * WIDTH * HEIGHT)));

        for (unsigned i = 0; i < WIDTH * HEIGHT; ++i) {
            image[i] = 4095;
        }

        binnedImage.reset(static_cast<uint16_t*>(calloc(sizeof(uint16_t), NX * NY)));
    }

    void TearDown(const ::benchmark::State& state)
    {
        image.reset();
        binnedImage.reset();
    }
};

}

namespace with_overflow
{

void binning(uint16_t* image, uint16_t* binnedImage) {
    int i, j, k, l;

    for (j = 0; j < NY; j++) {
        for (i = 0; i < NX; i++) {
            binnedImage[j * NX + i] = 0;
            for (k = i * BINNINGFACTORWIDTH; k < i * BINNINGFACTORWIDTH + BINNINGFACTORWIDTH; k++) {
                for (l = j * BINNINGFACTORHEIGHT; l < j * BINNINGFACTORHEIGHT + BINNINGFACTORHEIGHT; l++) {
                    binnedImage[j * NX + i] += image[l * HEIGHT + k];
                }
            }
            binnedImage[j * NX + i] /= DIVISIONFACTOR;
        }
    }

}

}

namespace without_overflow
{

void binning(uint16_t* image, uint16_t* binnedImage) {
    int i, j, k, l;

    for (j = 0; j < NY; j++) {
        for (i = 0; i < NX; i++) {
            binnedImage[j * NX + i] = 0;
            for (k = i * BINNINGFACTORWIDTH; k < i * BINNINGFACTORWIDTH + BINNINGFACTORWIDTH; k++) {
                for (l = j * BINNINGFACTORHEIGHT; l < j * BINNINGFACTORHEIGHT + BINNINGFACTORHEIGHT; l++) {
                    binnedImage[j * NX + i] += (image[l * HEIGHT + k] / DIVISIONFACTOR);
                }
            }
        }
    }
}

}

namespace bin_separately
{


void binning(uint16_t* const image, uint16_t* const binnedImage)
{
    // just some stuff for static_assert
    using PixelValueType = std::remove_cvref_t<decltype(*image)>;

    constexpr PixelValueType AllOnes = ~static_cast<PixelValueType>(0);
    constexpr unsigned BitCount = 12;
    constexpr uint64_t PixelInValueMax = static_cast<PixelValueType>(~(AllOnes << BitCount)); // use 64 bit to prevent overflow issues
    constexpr uint64_t PixelTypeMax = (std::numeric_limits<PixelValueType>::max)();
    // end static_assert stuff


    {
        // compress horizontally
        static_assert(PixelInValueMax * BINNINGFACTORWIDTH <= PixelTypeMax,
            "cannot compress horizontally without risking overflow");

        auto out = image;
        for (auto inPos = image, end = image + WIDTH * HEIGHT; inPos != end;)
        {
            uint_fast16_t sum = 0;
            for (unsigned i = 0; i != BINNINGFACTORWIDTH; ++i)
            {
                sum += *(inPos++);
            }
            *(out++) = sum;
        }
    }

    {
        // compress vertically, divide and write to out

        //read pointers
        uint16_t* inPoss[BINNINGFACTORHEIGHT];
        for (unsigned i = 0; i != BINNINGFACTORHEIGHT; ++i)
        {
            inPoss[i] = image + (NX * i);
        }

        for (auto out = binnedImage, end = binnedImage + NX * NY; out != end;) // for all output rows
        {
            for (auto const rowEnd = out + NX; out != rowEnd;)
            {
                uint_fast16_t sum = 0;

                static_assert(PixelInValueMax * BINNINGFACTORWIDTH * BINNINGFACTORHEIGHT <= (std::numeric_limits<decltype(sum)>::max)(),
                    "type of sum needs replacement, since it cannot hold the result of adding up all source pixels for one target pixel");

                for (unsigned i = 0; i != BINNINGFACTORHEIGHT; ++i)
                {
                    sum += *(inPoss[i]++);
                }
                *(out++) = sum / DIVISIONFACTOR;
            }

            // we advanced each pointer by one row -> advance by (BINNINGFACTORHEIGHT - 1) more
            for (unsigned i = 0; i != BINNINGFACTORHEIGHT; ++i)
            {
                inPoss[i] += NX * (BINNINGFACTORHEIGHT - 1);
            }
        }
    }

}

}

BENCHMARK_F(ImageAllocationFixture, WithOverflow)(benchmark::State& st)
{
    for (auto _ : st)
    {
        with_overflow::binning(image.get(), binnedImage.get());
        auto outData = binnedImage.get();
        benchmark::DoNotOptimize(outData);
        benchmark::ClobberMemory();
    }
}

BENCHMARK_F(ImageAllocationFixture, WithoutOverflow)(benchmark::State& st)
{
    for (auto _ : st)
    {
        without_overflow::binning(image.get(), binnedImage.get());
        auto outData = binnedImage.get();
        benchmark::DoNotOptimize(outData);
        benchmark::ClobberMemory();
    }
}

BENCHMARK_F(ImageAllocationFixture, BinSeparately)(benchmark::State& st)
{
    for (auto _ : st)
    {
        bin_separately::binning(image.get(), binnedImage.get());
        auto outData = binnedImage.get();
        benchmark::DoNotOptimize(outData);
        benchmark::ClobberMemory();
    }
}

BENCHMARK_MAIN();

Output for my compiler/machine (MSVC 19.34.31937 x86_64, /O2)

Run on (12 X 3593 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 512 KiB (x6)
  L3 Unified 16384 KiB (x2)
---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
ImageAllocationFixture/WithOverflow        996287 ns       941685 ns          896
ImageAllocationFixture/WithoutOverflow    1171365 ns      1098633 ns          640
ImageAllocationFixture/BinSeparately       350364 ns       353021 ns         2036