c++optimizationgraphicsdrm

Advice on improving a function's performace


For a project I'm working on, I require a function which copies the contents of a rectangular image into another via its pixel buffers. The function needs to account for edge collisions on the destination image as the two images are rarely going to be the same size.

I'm looking for tips on the most optimal way to do this, as the function I'm using can copy a 720x480 image into a 1920x955 image in just under 1.5ms. That's fine on its own, but hardly optimal.

#define coord(x, y) ((void *) (dest + 4 * ((y) * width + (x))))
#define scoord(x, y) ((void *) (src + 4 * ((y) * src_width + (x))))

void copy_buffer(uint8_t* dest, int width, int height, uint8_t* src, int src_width, int src_height, int x, int y) {
    if (x + src_width < 0 || x >= width || y + src_height < 0 || y >= height || src_width <= 0 || src_height <= 0)
        return;

    for (int line = std::max(0, y); line < std::min(height, y + src_height); line++)
        memcpy(coord(std::max(0, x), line), scoord(-1 * std::min(0, x), -1 * std::min(0, y)), (std::min(x + src_width, width) - std::max(0, x)) * 4);

}

Some things I've considered

  1. Multithreading seems suboptimal for several reasons;

    • Race conditions from simultaneous access to the same memory region,
    • Overhead from spawning and managing separate threads
  2. Using my system's GPU

    • Effectively multithreading
    • Huge overhead for moving and managing data between GPU and CPU
    • Not portable to my target platform
  3. Algorithmic optimisations such as calculating multi-image bounding boxes and adding **** loads more code to only render the regions of the image that will be visible

    • While I was planning on doing this anyway, I thought I'd mention it here to ask for further information on how to best achieve this
  4. Using a library/os function to do this for me

    • I'm new-ish to programming on the low level, and especially to performance-oriented programming, so there's always the chance I've missed something.
    • I'm presently not using a multimedia framework like SFML, because I'm trying to focus on executable and codebase size, but if that's the best idea, so be it.

Whew, bit of a mouthful. I apologise, but I would seriously appreciate any pointers.

Extra notes: I'm writing for/on Linux embedded devices over the DRI/M interface.


Edit

As per @Jérôme Richard's comment, some information about my system

Development machine: Dell inspiron 15 7570, 16GB RAM, i7 8core + Ubuntu 21.04 Target machine: Raspberry Pi 3B (1GB RAM, Broadcom something-or-other) 4 cores 1.4GHz + Ubuntu Server for Pi

Compiler: GCC/G++ 11.2.0


Solution

  • Your code is mostly bounded by memory operations and more specifically the memcpy since compilers (like GCC 11) already optimize it aggressively. memcpy is generally very efficiently implemented. That being said, it is sometime sub-optimal. I think this is the case here. To understand why, we need to delve into the low-level architecture of mainstream modern processor and more specifically the cache hierarchy.

    There are two main writing cache policy widely used in mainstream processors: the write-back policy (with write-allocate) and the write-through policy (without write-allocate). Intel processors and the BCM2837 (Cortex-A53) of the Raspberry Pi 3B use a write-back policy with write-allocate. Write allocation cause data at the missed-write location to be loaded to cache, followed by a write-hit operation. The thing is that cache lines written to the last-level cache (LLC) need first to be read from RAM before being written back. This can cause up to half the bandwidth to be wasted!

    To address this issue when big arrays are written into the RAM (not fit in the LLC), non-temporal instructions have been designed. The goal of such instructions is to write directly into the RAM and bypass the cache hierarchy so not to cause cache pollution and better use the memory bandwidth.

    memcpy is generally designed to use non-temporal instructions when a large buffer is copied. However, when many small buffers are copied, memcpy cannot know that the overall set of buffer would be too big to fit in the LLC or even that written data are not meant to be reused soon. In fact, sometime, even developers cannot know since the size of the computed buffer can be dependent of the user and the size of the LLC is dependent of the user target machine.

    The bad news is that such instructions can hardly be used from a high-level code. ICC supports that using pragma directives, Clang has an experimental support with builtins and AFAIK GCC does not support that yet. OpenMP 5 provide a portable pragma directive for that but it is not yet implemented. For more information, please read this post.

    On x86-64 processors, you can use the SSE and AVX intrinsics _mm_stream_si128 and _mm256_stream_si256. Note that the pointers need to be aligned in memory.

    On ARMv8 processors, there is an instruction STNP but it is only an hint and it is not really meant to bypass the cache. Moreover, the Cortex-A53 specification seems to only support non-temporal loads (not non-temporal stores). Put it shortly, ARMv8 does not have such instructions.

    The good news is that the very recent ARMv9 instruction set adds new instructions for that. In fact it adds specific instructions for non-temporal copies. You can get the specification here.

    Since the Raspberry Pi 3B practical memory benchmarks indicates a read bandwidth of 2.7 GB/s and a write bandwidth of 2.4 GB/s. A copy of a 720x480 3-chanel image with write allocations would take about 720 * 480 * 3 * (1/2.4e9 + 2/2.7e9) = 1.2 ms which is close to your reported execution time. It is also important to mention that read-write DRAM accesses tends to be a bit slower than plain reads or plain writes. If the STNP hint works well, it would results in 0.8 ms (optimal on this hardware). If you want a faster execution time, then you need to work on smaller images.

    On you Dell machine with an Intel Kaby-Lake processor and certainly 1~2 2400MHz DDR4 memory channels, the practical throughput should be 12~35 GiB/s resulting in 0.1~0.3 ms to copy the image with the basic memcpy and 0.1~0.2 ms with the non-temporal instructions. Moreover, the LLC cache is much bigger so that the operations should be even faster with the basic memcpy as long as the image fit in the LLC (certainly less than 0.1 ms).