openmpintel-mic

Memory transfer overhead to and from an Intel MIC


I'm observing a strange behavior and would like to know if it is Intel Xeon Phi related or not.

I have a little example code basically the matrix multiplication everyone knows (three nested for loops). I offload the computation to an Intel MIC with OpenMP 4.0 target pragma and map the three matrices with map(to:A,B) map(tofrom:C).

Now, what I am observing is that for small matrices e.g. 1024x1024 the memory transfer took extremely long. Compared to the native version (same code, same parallelisation strategy, just no offloading) the offload version consumes about 320ms more time. I did a warm-up run of the code to remove initialization overhead.

Compared to a Nvidia Tesla K20 where the same amount of memory is copied without noticing this 320ms are very bad.

Are there some environment settings that may improve the memory transfer speed?

An additionally question: I enabled offload reporting via the OFFLOAD_REPORT environment variable. What are the differences between the two timing results shown in the report:

[Offload] [HOST]  [Tag 5] [CPU Time]        26.995279(seconds)
[Offload] [MIC 0] [Tag 5] [CPU->MIC Data]   3221225480 (bytes)
[Offload] [MIC 0] [Tag 5] [MIC Time]        16.859548(seconds)
[Offload] [MIC 0] [Tag 5] [MIC->CPU Data]   1073741824 (bytes)

What are those 10 seconds missing at MIC Time (memory transfer?)

Well a third question. Is it possible to used pinned memory with Intel MICs? If yes, how?


Solution

  • It is possibly the memory allocation on MIC that is taking time. Try and separate the three sources of overhead to better understand where the time goes:

    // Device initialization
    #pragma offload_transfer target(mic)
    ...
    // Memory allocation and first data transfer
    // This is expected to have overhead proportional to the amount of memory allocated
    // Doing at least one transfer will speed up subsequent transfers
    #pragma offload_transfer target(mic) in(p[0:SIZE] : alloc_if(1) free_if(0))
    
    ...
    // This transfer should be faster
    // For large sizes, approaching 6 GiB/s
    #pragma offload_transfer target(mic) in(p[0:SIZE] : alloc_if(0) free_if(0))