memory optimization cuda gpgpu unified-memory

Behavior and performance of unified memory vs pinned host memory

I am a student who is currently working on a project that consists of writing a certain program in the CUDA. I believe the subject of this program, is irrelevant for the question; but I have to mention that my professor, suggested me to use unified memory in my program after he saw my attempts to create CUDA smart pointer class. It is best to describe this class as "unique pointer to an array" which fulfills the RAII idiom.

After checking the CUDA 6.0 release notes about Unified Memory (and updates introduced in CUDA 8.0) I was full of doubt whether I should switch to unified memory or not.

Q1: I know that CUDA unified memory is mapping GPU and CPU memory. But what kind of CPU memory are we talking about? Is it pinned memory witch allows faster data transfer? Or is it standard paged system memory?

Q2: I know that updates introduced in CUDA 8.0 are mostly about pascal architecture. But can I expect acceleration on the Maxwell architecture (with respect to host pinned memory)?

Q3: Even though I am just a student, I can see that NVIDIA is putting a lot of work into developing unified memory. Therefore one might think that using unified memory is a better idea in long term perspective. Am I right?

Q4: Is it true that each time I want to access single element of an array on host (while data reside on device) the whole array will be copied to host?

Solution

Smart pointers for GPU memory

(Part of) your original motivation had been the possibility of using smart pointers for (global) GPU memory; and your Professor suggested using unified memory to that end (although it's not exactly clear to me how that would help). Well, the thing is, you don't have to reinvent the wheel for that - you can already have unique_ptr's for (different kinds of) CUDA GPU memory, as part of the cuda-api-wrappers library.

These unique pointers are actually std::unique_ptr(), but with custom deleters (and you create them with appropriate methods. You can find the a listing of the methods for creating them on this doxygen page (although the documentation is very partial at this point).

For an example of use, consider the CUDA samples example vectorAdd, which performs elementwise addition of two vectors to produce a third. Here is the same sample, using smart pointers for both the host and the device memory (and the API wrappers more generally ).

Caveat: I'm the author of the API wrapper library, so I'm biased in favor of using it :-)

(Partial) answers to your specific questions

Q1: What kind of CPU memory are we talking about [for unified memory allocations]? Is it pinned memory... Or... standard paged system memory?

I don't know, but you can easily find out by writing a small program that:

Allocates some managed memory.
Writes into it on the host side.
Prefetches it to the GPU, then exits.

... and profiling it to determine the PCIe bandwidth. With PCIe 3.0 and no intervening traffic, I usually get ~12 GB/sec from pinned memory and about half that from unpinned memory.

Q2: ... in CUDA 8.0 ... can I expect acceleration on the Maxwell architecture (with respect to host pinned memory)?

In my very limited experience, the performance of unified memory access cards does not improve in CUDA 8.0 relative to CUDA 6.0. (but there may be under-the-hood changes in prefetching logic or general code optimizations which do show improvements in some cases.) Regardless of that, remember that CUDA 6.0 doesn't support sm_52 targets so your question is a bit moot.

Q3: ... I can see that NVIDIA is putting a lot of work into developing unified memory. Therefore one might think that using unified memory is a better idea in long term perspective. Am I right?

I believe that you're wrong. As the CUDA Programming guide suggests, unified memory is a mechanism intended to simply memory accesses and programming; it sacrifices some speed for more uniform, simpler, code. While nVIDIA's efforts may reduce the overhead of using it somewhat, there's no mad optimization dash which would make that go away. On Kepler Tesla's, using unified memory is typically up to 1.8x-2x slower on various benchmarks; and even though I don't have figures for Maxwell or Pascal, I doubt this will drop so much as to make you prefer using unified memory across the board.

Q4: Is it true that each time I want to access single element of an array on host (while data reside on device) the whole array will be copied to host?

No, managed memory is paged; so only a single page will be copied across the PCIe bus. But if the array is small it could be the entire array.