cudagpunvidianvml

Using nvidia-smi what is the best strategy to capture power


I am using Tesla K20c and measuring power with nvidia-smi as my application is run. My problem is power consumption does not reach a steady state but keeps rising. For example, if my application runs for 100 iterations, power reaches 106W(in 4 seconds), for 1000 iterations 117 W (in 41 seconds), for 10000 iterations 122W (in 415 seconds) and so on increasing slightly every time. I am writing for some recommendation on which power value I should record. In my experimental setup I have over 400 experiments, and doing each one for 10000 iterations is not feasible at least for now. The application is matrix multiplication which is doable in just one iteration taking just a few milliseconds. Increasing the number of iterations does not bring any value to the results, but it increases the run time allowing power monitoring.


Solution

  • The reason you are seeing power consumption increase over time is that the GPU is heating up under a sustained load. Electronic components draw more power at increased temperature mostly due to an increase in Ohmic resistance. In addition, the Tesla K20c is an actively cooled GPU: as the GPU heats up, the fan on the card spins faster and therefore requires more power.

    I have run experiments on a K20c that were very similar to yours, out to about 10 minutes. I found that the power draw plateaus after 5 to 6 minutes, and that there are only noise-level oscillations of +/-2 W after that. These may be due to hysteresis in the fan's temperature-controlled feedback loop, or due to short-term fluctuations from incomplete utilization of the GPU at the end of every kernel. Difference in power draw due to fan speed difference were about 5 W. The reason it takes fairly long for the GPU to reach steady state is the heat capacity of the entire assembly, which has quite a bit of mass, including a solid metal back plate.

    Your measurements seem to be directed at determining the relative power consumption when running with 400 different variants of the code. It does not seem critical that steady-state power consumption is achieved, just that the conditions under which each variant is tested are as equal as is practically achievable. Keep in mind that the GPU's power sensors are not designed to provide high-precision measurements, so for comparison purposes you would want to assume a noise level on the order of 5%. For an accurate comparison you may even want to average measurements from more than one GPU of the same type, as manufacturing tolerances could cause variations in power draw between multiple "identical" GPUs.

    I would therefore suggest the following protocol: Run each variant for 30 seconds, measuring power consumption close to the end of that interval. Then let the GPU idle for 30 seconds to let it cool down before running the next kernel. This should give roughly equal starting conditions for each variant. You may need to lengthen the proposed idle time a bit if you find that the temperature stays elevated for a longer time. The temperature data reported by nvidia-smi can guide you here. With this process you should be able to complete the testing of 400 variants in an overnight run.