Something that's been confusing me for a while is the alignment requirement of allocated CUDA memories. I know that if they are aligned, accessing row elements will be much more efficient.
First a little background:
According to CUDA C Programming Guide (section 5.3.2):
Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions. These memory transactions must be naturally alignedOnly the 32-, 64-, or 128-byte segments of device memory that are aligned to their size (i.e., whose first address is a multiple of their size) can be read or written by memory transactions.
My understanding is that for a 2D interleaved array of type T
, (say pixel values in R,G,B order), if numChannels * sizeof(T)
is either 4, 8 or 16, then the array has to be allocated using cudaMallocPitch
if performance is a necessity. So far this has been working fine for me. I'd check numChannels * sizeof(T)
before allocating a 2D array and if it is 4, 16 or 32, I allocate it using cudaMallocPitch
and everything works.
Now the question:
I've realized that when using NVIDIA's NPP library, there is a family of allocator functions (nppiMalloc
... like nppiMalloc_32f_C1
and so on). NVIDIA has recommended using these functions for performance. My question is that, how are these functions guaranteeing the alignment? More specifically, what kind of math are they using to come up with a suitable value for pitch
?
For a single channel 512x512 pixel image (with float pixel values in the range [0, 1]) I've used both cudaMallocPitch
and nppiMalloc_32f_C1
.
cudaMallocPitch
gave me a pitch value of 2048 while nppiMalloc_32f_C1
gave me 2560. Where is the latter number coming from and how exactly is that?
Why I care about this
I'm writing a synced memory class template for synchronizing values on GPU and CPU. This class is supposed to be taking care of allocating pitched memories (if possible) under the hood. Since I want this class to be interoperable with NVIDIA's NPP, I'd like to handle all allocations in a way that would provide good performance for CUDA kernels as well as NPP operations.
My impression was that nppiMalloc
was calling cudaMallocPitch
under the hood, but it seems that I'm wrong.
An interesting question. However, there may be no definitive answer at all, for several reasons: The implementation of these methods is not publicly available. One has to assume that NVIDIA uses some special tricks and tweaks internally. Moreover: The resulting pitch is not specified. So one has to assume that it might change between several releases of CUDA/NPP. Particularly, it's not unlikely that the actual pitch will depend on the hardware version (the "Compute Capability") of the device that the method is executed on.
Nevertheless, I was curious about this and wrote the following test:
#include <stdio.h>
#include <npp.h>
template <typename T>
void testStepBytes(const char* name, int elementSize, int numComponents,
T (*allocator)(int, int, int*))
{
printf("%s\n", name);
int dw = 1;
int prevStepBytes = 0;
for (int w=1; w<2050; w+=dw)
{
int stepBytes;
void *p = allocator(w, 1, &stepBytes);
nppiFree(p);
if (stepBytes != prevStepBytes)
{
printf("Stride %5d is used up to w=%5d (%6d bytes)\n",
prevStepBytes, (w-dw), (w-dw)*elementSize*numComponents);
prevStepBytes = stepBytes;
}
}
}
int main(int argc, char *argv[])
{
testStepBytes("nppiMalloc_8u_C1", 1, 1, &nppiMalloc_8u_C1);
testStepBytes("nppiMalloc_8u_C2", 1, 2, &nppiMalloc_8u_C2);
testStepBytes("nppiMalloc_8u_C3", 1, 3, &nppiMalloc_8u_C3);
testStepBytes("nppiMalloc_8u_C4", 1, 4, &nppiMalloc_8u_C4);
testStepBytes("nppiMalloc_16u_C1", 2, 1, &nppiMalloc_16u_C1);
testStepBytes("nppiMalloc_16u_C2", 2, 2, &nppiMalloc_16u_C2);
testStepBytes("nppiMalloc_16u_C3", 2, 3, &nppiMalloc_16u_C3);
testStepBytes("nppiMalloc_16u_C4", 2, 4, &nppiMalloc_16u_C4);
testStepBytes("nppiMalloc_32f_C1", 4, 1, &nppiMalloc_32f_C1);
testStepBytes("nppiMalloc_32f_C2", 4, 2, &nppiMalloc_32f_C2);
testStepBytes("nppiMalloc_32f_C3", 4, 3, &nppiMalloc_32f_C3);
testStepBytes("nppiMalloc_32f_C4", 4, 4, &nppiMalloc_32f_C4);
return 0;
}
The pitch (stepBytes) seemed to solely depend on the width of the image. So this program allocates memory for images of different types, with an increasing width, and prints information about the maximum image sizes that result in a particular stride. The intention was to derive a pattern or a rule - namely the "kind of math" that you asked about.
The results are ... somewhat confusing. For example, for the nppiMalloc_32f_C1
call, on my machine (CUDA 6.5, GeForce GTX 560 Ti, Compute Capability 2.1), it prints:
nppiMalloc_32f_C1
Stride 0 is used up to w= 0 ( 0 bytes)
Stride 512 is used up to w= 120 ( 480 bytes)
Stride 1024 is used up to w= 248 ( 992 bytes)
Stride 1536 is used up to w= 384 ( 1536 bytes)
Stride 2048 is used up to w= 504 ( 2016 bytes)
Stride 2560 is used up to w= 640 ( 2560 bytes)
Stride 3072 is used up to w= 768 ( 3072 bytes)
Stride 3584 is used up to w= 896 ( 3584 bytes)
Stride 4096 is used up to w= 1016 ( 4064 bytes)
Stride 4608 is used up to w= 1152 ( 4608 bytes)
Stride 5120 is used up to w= 1280 ( 5120 bytes)
Stride 5632 is used up to w= 1408 ( 5632 bytes)
Stride 6144 is used up to w= 1536 ( 6144 bytes)
Stride 6656 is used up to w= 1664 ( 6656 bytes)
Stride 7168 is used up to w= 1792 ( 7168 bytes)
Stride 7680 is used up to w= 1920 ( 7680 bytes)
Stride 8192 is used up to w= 2040 ( 8160 bytes)
confirming that for an image with width=512, it will use a stride of 2560. The expected stride of 2048 would be used for an image up to width=504.
The numbers seem a bit odd, so I ran another test for nppiMalloc_8u_C1
in order to cover all possible image line sizes (in bytes), with larger image sizes, and noticed a strange pattern: The first increase of the pitch size (from 512 to 1024) occurred when the image was larger than 480 bytes, and 480=512-32. The next step (from 1024 to 1536) occurred when the image was larger than 992 bytes, and 992=480+512. The next step (from 1536 to 2048) occurred when the image was larger than 1536 bytes, and 1536=992+512+32. From there, it seemed to mostly run in steps of 512, except for several sizes in between. The further steps are summarized here:
nppiMalloc_8u_C1
Stride 0 is used up to w= 0 ( 0 bytes, delta 0)
Stride 512 is used up to w= 480 ( 480 bytes, delta 480)
Stride 1024 is used up to w= 992 ( 992 bytes, delta 512)
Stride 1536 is used up to w= 1536 ( 1536 bytes, delta 544)
Stride 2048 is used up to w= 2016 ( 2016 bytes, delta 480) \
Stride 2560 is used up to w= 2560 ( 2560 bytes, delta 544) | 4
Stride 3072 is used up to w= 3072 ( 3072 bytes, delta 512) |
Stride 3584 is used up to w= 3584 ( 3584 bytes, delta 512) /
Stride 4096 is used up to w= 4064 ( 4064 bytes, delta 480) \
Stride 4608 is used up to w= 4608 ( 4608 bytes, delta 544) |
Stride 5120 is used up to w= 5120 ( 5120 bytes, delta 512) |
Stride 5632 is used up to w= 5632 ( 5632 bytes, delta 512) | 8
Stride 6144 is used up to w= 6144 ( 6144 bytes, delta 512) |
Stride 6656 is used up to w= 6656 ( 6656 bytes, delta 512) |
Stride 7168 is used up to w= 7168 ( 7168 bytes, delta 512) |
Stride 7680 is used up to w= 7680 ( 7680 bytes, delta 512) /
Stride 8192 is used up to w= 8160 ( 8160 bytes, delta 480) \
Stride 8704 is used up to w= 8704 ( 8704 bytes, delta 544) |
Stride 9216 is used up to w= 9216 ( 9216 bytes, delta 512) |
Stride 9728 is used up to w= 9728 ( 9728 bytes, delta 512) |
Stride 10240 is used up to w= 10240 ( 10240 bytes, delta 512) |
Stride 10752 is used up to w= 10752 ( 10752 bytes, delta 512) |
Stride 11264 is used up to w= 11264 ( 11264 bytes, delta 512) |
Stride 11776 is used up to w= 11776 ( 11776 bytes, delta 512) | 16
Stride 12288 is used up to w= 12288 ( 12288 bytes, delta 512) |
Stride 12800 is used up to w= 12800 ( 12800 bytes, delta 512) |
Stride 13312 is used up to w= 13312 ( 13312 bytes, delta 512) |
Stride 13824 is used up to w= 13824 ( 13824 bytes, delta 512) |
Stride 14336 is used up to w= 14336 ( 14336 bytes, delta 512) |
Stride 14848 is used up to w= 14848 ( 14848 bytes, delta 512) |
Stride 15360 is used up to w= 15360 ( 15360 bytes, delta 512) |
Stride 15872 is used up to w= 15872 ( 15872 bytes, delta 512) /
Stride 16384 is used up to w= 16352 ( 16352 bytes, delta 480) \
Stride 16896 is used up to w= 16896 ( 16896 bytes, delta 544) |
Stride 17408 is used up to w= 17408 ( 17408 bytes, delta 512) |
... ... 32
Stride 31232 is used up to w= 31232 ( 31232 bytes, delta 512) |
Stride 31744 is used up to w= 31744 ( 31744 bytes, delta 512) |
Stride 32256 is used up to w= 32256 ( 32256 bytes, delta 512) /
Stride 32768 is used up to w= 32736 ( 32736 bytes, delta 480) \
Stride 33280 is used up to w= 33280 ( 33280 bytes, delta 544) |
Stride 33792 is used up to w= 33792 ( 33792 bytes, delta 512) |
Stride 34304 is used up to w= 34304 ( 34304 bytes, delta 512) |
... ... 64
Stride 64512 is used up to w= 64512 ( 64512 bytes, delta 512) |
Stride 65024 is used up to w= 65024 ( 65024 bytes, delta 512) /
Stride 65536 is used up to w= 65504 ( 65504 bytes, delta 480) \
Stride 66048 is used up to w= 66048 ( 66048 bytes, delta 544) |
Stride 66560 is used up to w= 66560 ( 66560 bytes, delta 512) |
Stride 67072 is used up to w= 67072 ( 67072 bytes, delta 512) |
.... ... 128
Stride 130048 is used up to w=130048 (130048 bytes, delta 512) |
Stride 130560 is used up to w=130560 (130560 bytes, delta 512) /
Stride 131072 is used up to w=131040 (131040 bytes, delta 480) \
Stride 131584 is used up to w=131584 (131584 bytes, delta 544) |
Stride 132096 is used up to w=132096 (132096 bytes, delta 512) |
... | guess...
There obviously is a pattern. The pitches are related to multiples of 512. For sizes of 512*2n, with n being a whole number, there are some odd -32 and +32 offsets for the size limits that cause a larger pitch to be used.
Maybe I'll have another look on this. I'm pretty sure that one could derive a formula covering this odd progression of the pitch. But again: This may depend on the underlying CUDA version, the NPP version, or even the Compute Capability of the card that is used.
And, just for completeness: It might also be the case that this strange pitch size simply is a bug in NPP. You never know.