gpunvidiapower-managementkeplernvml

nvmlDeviceGetPowerManagementMode() always returning NVML_ERROR_INVALID_ARGUMENT?


I am writing a code to measure the power usage of an NVIDIA Tesla K20 GPU (Kepler architecture) periodically using the NVML API.

Variables:

nvmlReturn_t result;
nvmlEnableState_t pmmode;
nvmlDevice_t nvmlDeviceID;
unsigned int powerInt;

Basic code:

result = nvmlDeviceGetPowerManagementMode(nvmlDeviceID, &pmmode);
if (pmmode == NVML_FEATURE_ENABLED) {
    result = nvmlDeviceGetPowerUsage(nvmlDeviceID, &powerInt);
}

My issue is that nvmlDeviceGetPowerManagementMode is always returning NVML_ERROR_INVALID_ARGUMENT. I checked this.

The NVML API Documentation says that NVML_ERROR_INVALID_ARGUMENT is returned when either nvmlDeviceID is invalid or pmmode is NULL.

nvmlDeviceID is definitely valid because I am able to query its properties which match with my GPU. But I don't see why I should set the value of pmmode to anything, because the documentation says that it is a Reference in which to return the current power management mode. For the record, I tried assigning an enable value to it, but the result was still the same.

I am clearly doing something wrong because other users of the system have written their own libraries using this function, and they face no problem. I am unable to contact them. What should I fix to get this function to work correctly?


Solution

  • The problem here was not directly in the API call - it was in the rest of the code - but the answer might be useful to others. Before attempting this solution, one must know for a fact that Power Management mode is enabled (check with nvidia-smi -q -d POWER).

    In case of the invalid argument error, it is very likely that the problem lies with the nvmlDeviceID. I said I was able to query the device properties and at the time I was sure it was right, but be aware of any API calls that modify the nvmlDeviceID value later on.

    For example, in this case, the following API call had some_variable as an invalid index, so nvmlDeviceID became invalid.

    nvmlDeviceGetHandleByIndex(some_variable, &nvmlDeviceID);
    

    It had to be changed to:

    nvmlDeviceGetHandleByIndex(0, &nvmlDeviceID);
    

    So the solution is to either remove all API calls that change or invalidate the value of nvmlDeviceID, or at least to ensure that any existing API call in the code does not modify the value.