exceptioncudacuda-gdb

States of memory data after cuda exceptions


CUDA document is not clear on how memory data changes after CUDA applications throws an exception.

For example, a kernel launch(dynamic) encountered an exception (e.g. Warp Out-of-range Address), current kernel launch will be stopped. After this point, will data (e.g. __device__ variables) on device still kept or they are removed along with the exceptions?

A concrete example would be like this:

  1. CPU launches a kernel
  2. The kernel updates the value of __device__ variableA to be 5 and then crashes
  3. CPU memcpy the value of variableA from device to host, what is the value the CPU gets in this case, 5 or something else?

Can someone show the rationale behind this?


Solution

  • The behavior is undefined in the event of a CUDA error which corrupts the CUDA context.

    This type of error is evident because it is "sticky", meaning once it occurs, every single CUDA API call will return that error, until the context is destroyed.

    Non-sticky errors are cleared automatically after they are returned by a cuda API call (with the exception of cudaPeekAtLastError). Any "crashed kernel" type error (invalid access, unspecified launch failure, etc.) will be a sticky error. In your example, step 3 would (always) return an API error on the result of the cudaMemcpy call to transfer variableA from device to host, so the results of the cudaMemcpy operation are undefined and unreliable -- it is as if the cudaMemcpy operation also failed in some unspecified way.

    Since the behavior of a corrupted CUDA context is undefined, there is no definition for the contents of any allocations, or in general the state of the machine after such an error.

    An example of a non-sticky error might be an attempt to cudaMalloc more data than is available in device memory. Such an operation will return an out-of-memory error, but that error will be cleared after being returned, and subsequent (valid) cuda API calls can complete successfully, without returning an error. A non-sticky error does not corrupt the CUDA context, and the behavior of the cuda context is exactly the same as if the invalid operation had never been requested.

    This distinction between sticky and non-sticky error is called out in many of the documented error code descriptions, for example:

    synchronous, non-sticky, non-cuda-context-corrupting:

    cudaErrorMemoryAllocation = 2 The API call failed because it was unable to allocate enough memory to perform the requested operation.

    asynchronous, sticky, cuda-context-corrupting:

    cudaErrorMisalignedAddress = 74 The device encountered a load or store instruction on a memory address which is not aligned. The context cannot be used, so it must be destroyed (and a new one should be created). All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA.

    Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.