Using only driver api, for example, I have a profiling with single process below(cuCtxCreate), cuCtxCreate overhead is nearly comparable to 300MB data copy to/from GPU:
In CUDA documentation here, it says(for cuDevicePrimaryCtxRetain) Retains the primary context on the device, creating it **if necessary**
. Is this an expected behavior for repeated calls to same process from command line(such as running a process 1000 times for explicitly processing 1000 different input images)? Does device need CU_COMPUTEMODE_EXCLUSIVE_PROCESS to work as intended(re-use same context when called multiple times)?
For now, upper image is same even if I call that process multiple times. Even without using profiler, timings show around 1second completion time.
Edit: According the documentation, primary context is one per device per process
. Does this mean there won't be a problem when using multiple threaded single application?
What is re-use time limit for primary context? Is 1 second between processes okay or does it have to be miliseconds to keep primary context alive?
I'm already caching ptx codes into a file so the only remaining overhead looks like cuMemAlloc(), malloc() and cuMemHostRegister()
so re-using latest context from last call to same process would optimize timings good.
Edit-2: Documentation says The caller must call cuDevicePrimaryCtxRelease() when done using the context.
for cuDevicePrimaryCtxRetain
. Is caller here any process? Can I just use retain in first called process and use release on the last called process in a list of hundreds of sequentally called processes? Does system need a reset if last process couldn't be launched and cuDevicePrimaryCtxRelease
not called?
Edit-3:
Is primary context intended for this?
process-1: retain (creates)
process-2: retain (re-uses)
...
process-99: retain (re-uses)
process-100: 1 x retain and 100 x release (to decrease counter and unload at last)
Is cuDevicePrimaryCtxRetain() used for having persistent CUDA context objects between multiple processes?
No. It is intended to allow the driver API to bind to a context which a library which has used the runtime API has already lazily created. Nothing more than that. Once upon a time it was necessary to create contexts with the driver API and then have the runtime bind to them. Now, with these APIs, you don't have to do that. You can, for example, see how this is done in Tensorflow here.
Does this mean there won't be a problem when using multiple threaded single application?
The driver API has been fully thread safe since about CUDA 2.0
Is caller here any process? Can I just use retain in first called process and use release on the last called process in a list of hundreds of sequentally [sic] called processes?
No. Contexts are always unique to a given process. They can't be shared between processes in this way
Is primary context intended for this?
process-1: retain (creates) process-2: retain (re-uses) ... process-99: retain (re-uses) process-100: 1 x retain and 100 x release (to decrease counter and unload at last)
No.