cvisual-studiomemcpycvimemory-bandwidth

How to increase performance of memcpy


Summary:

memcpy seems unable to transfer over 2GB/sec on my system in a real or test application. What can I do to get faster memory-to-memory copies?

Full details:

As part of a data capture application (using some specialized hardware), I need to copy about 3 GB/sec from temporary buffers into main memory. To acquire data, I provide the hardware driver with a series of buffers (2MB each). The hardware DMAs data to each buffer, and then notifies my program when each buffer is full. My program empties the buffer (memcpy to another, larger block of RAM), and reposts the processed buffer to the card to be filled again. I am having issues with memcpy moving the data fast enough. It seems that the memory-to-memory copy should be fast enough to support 3GB/sec on the hardware that I am running on. Lavalys EVEREST gives me a 9337MB/sec memory copy benchmark result, but I can't get anywhere near those speeds with memcpy, even in a simple test program.

I have isolated the performance issue by adding/removing the memcpy call inside the buffer processing code. Without the memcpy, I can run full data rate- about 3GB/sec. With the memcpy enabled, I am limited to about 550Mb/sec (using current compiler).

In order to benchmark memcpy on my system, I've written a separate test program that just calls memcpy on some blocks of data. (I've posted the code below) I've run this both in the compiler/IDE that I'm using (National Instruments CVI) as well as Visual Studio 2010. While I'm not currently using Visual Studio, I am willing to make the switch if it will yield the necessary performance. However, before blindly moving over, I wanted to make sure that it would solve my memcpy performance problems.

Visual C++ 2010: 1900 MB/sec

NI CVI 2009: 550 MB/sec

While I am not surprised that CVI is significantly slower than Visual Studio, I am surprised that the memcpy performance is this low. While I'm not sure if this is directly comparable, this is much lower than the EVEREST benchmark bandwidth. While I don't need quite that level of performance, a minimum of 3GB/sec is necessary. Surely the standard library implementation can't be this much worse than whatever EVEREST is using!

What, if anything, can I do to make memcpy faster in this situation?


Hardware details: AMD Magny Cours- 4x octal core 128 GB DDR3 Windows Server 2003 Enterprise X64

Test program:

#include <windows.h>
#include <stdio.h>

const size_t NUM_ELEMENTS = 2*1024 * 1024;
const size_t ITERATIONS = 10000;

int main (int argc, char *argv[])
{
    LARGE_INTEGER start, stop, frequency;

    QueryPerformanceFrequency(&frequency);

    unsigned short * src = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);
    unsigned short * dest = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);

    for(int ctr = 0; ctr < NUM_ELEMENTS; ctr++)
    {
        src[ctr] = rand();
    }

    QueryPerformanceCounter(&start);

    for(int iter = 0; iter < ITERATIONS; iter++)
        memcpy(dest, src, NUM_ELEMENTS * sizeof(unsigned short));

    QueryPerformanceCounter(&stop);

    __int64 duration = stop.QuadPart - start.QuadPart;

    double duration_d = (double)duration / (double) frequency.QuadPart;

    double bytes_sec = (ITERATIONS * (NUM_ELEMENTS/1024/1024) * sizeof(unsigned short)) / duration_d;

    printf("Duration: %.5lfs for %d iterations, %.3lfMB/sec\n", duration_d, ITERATIONS, bytes_sec);

    free(src);
    free(dest);

    getchar();

    return 0;
}

EDIT: If you have an extra five minutes and want to contribute, can you run the above code on your machine and post your time as a comment?


Solution

  • I have found a way to increase speed in this situation. I wrote a multi-threaded version of memcpy, splitting the area to be copied between threads. Here are some performance scaling numbers for a set block size, using the same timing code as found above. I had no idea that the performance, especially for this small size of block, would scale to this many threads. I suspect that this has something to do with the large number of memory controllers (16) on this machine.

    Performance (10000x 4MB block memcpy):
    
     1 thread :  1826 MB/sec
     2 threads:  3118 MB/sec
     3 threads:  4121 MB/sec
     4 threads: 10020 MB/sec
     5 threads: 12848 MB/sec
     6 threads: 14340 MB/sec
     8 threads: 17892 MB/sec
    10 threads: 21781 MB/sec
    12 threads: 25721 MB/sec
    14 threads: 25318 MB/sec
    16 threads: 19965 MB/sec
    24 threads: 13158 MB/sec
    32 threads: 12497 MB/sec
    

    I don't understand the huge performance jump between 3 and 4 threads. What would cause a jump like this?

    I've included the memcpy code that I wrote below for other that may run into this same issue. Please note that there is no error checking in this code- this may need to be added for your application.

    #define NUM_CPY_THREADS 4
    
    HANDLE hCopyThreads[NUM_CPY_THREADS] = {0};
    HANDLE hCopyStartSemaphores[NUM_CPY_THREADS] = {0};
    HANDLE hCopyStopSemaphores[NUM_CPY_THREADS] = {0};
    typedef struct
    {
        int ct;
        void * src, * dest;
        size_t size;
    } mt_cpy_t;
    
    mt_cpy_t mtParamters[NUM_CPY_THREADS] = {0};
    
    DWORD WINAPI thread_copy_proc(LPVOID param)
    {
        mt_cpy_t * p = (mt_cpy_t * ) param;
    
        while(1)
        {
            WaitForSingleObject(hCopyStartSemaphores[p->ct], INFINITE);
            memcpy(p->dest, p->src, p->size);
            ReleaseSemaphore(hCopyStopSemaphores[p->ct], 1, NULL);
        }
    
        return 0;
    }
    
    int startCopyThreads()
    {
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        {
            hCopyStartSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
            hCopyStopSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
            mtParamters[ctr].ct = ctr;
            hCopyThreads[ctr] = CreateThread(0, 0, thread_copy_proc, &mtParamters[ctr], 0, NULL); 
        }
    
        return 0;
    }
    
    void * mt_memcpy(void * dest, void * src, size_t bytes)
    {
        //set up parameters
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        {
            mtParamters[ctr].dest = (char *) dest + ctr * bytes / NUM_CPY_THREADS;
            mtParamters[ctr].src = (char *) src + ctr * bytes / NUM_CPY_THREADS;
            mtParamters[ctr].size = (ctr + 1) * bytes / NUM_CPY_THREADS - ctr * bytes / NUM_CPY_THREADS;
        }
    
        //release semaphores to start computation
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
            ReleaseSemaphore(hCopyStartSemaphores[ctr], 1, NULL);
    
        //wait for all threads to finish
        WaitForMultipleObjects(NUM_CPY_THREADS, hCopyStopSemaphores, TRUE, INFINITE);
    
        return dest;
    }
    
    int stopCopyThreads()
    {
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        {
            TerminateThread(hCopyThreads[ctr], 0);
            CloseHandle(hCopyStartSemaphores[ctr]);
            CloseHandle(hCopyStopSemaphores[ctr]);
        }
        return 0;
    }