pythonnvidia

How to get every second's GPU usage in Python


I have a model which runs by tensorflow-gpu and my device is nvidia. And I want to list every second's GPU usage so that I can measure average/max GPU usage. I can do this mannually by open two terminals, one is to run model and another is to measure by nvidia-smi -l 1. Of course, this is not a good way. I also tried to use a Thread to do that, here it is.

import subprocess as sp
import os
from threading import Thread

class MyThread(Thread):
    def __init__(self, func, args):
        super(MyThread, self).__init__()
        self.func = func
        self.args = args

    def run(self):
        self.result = self.func(*self.args)

    def get_result(self):
        return self.result

def get_gpu_memory():
   output_to_list = lambda x: x.decode('ascii').split('\n')[:-1]
   ACCEPTABLE_AVAILABLE_MEMORY = 1024
   COMMAND = "nvidia-smi -l 1 --query-gpu=memory.used --format=csv"
   memory_use_info = output_to_list(sp.check_output(COMMAND.split()))[1:]
   memory_use_values = [int(x.split()[0]) for i, x in enumerate(memory_use_info)]
   return memory_use_values

def run():
   pass

t1 = MyThread(run, args=())
t2 = MyThread(get_gpu_memory, args=())

t1.start()
t2.start()
t1.join()
t2.join()
res1 = t2.get_result()

However, this does not return every second's usage as well. Is there a good solution?


Solution

  • In the command nvidia-smi -l 1 --query-gpu=memory.used --format=csv

    the -l stands for:

    -l, --loop= Probe until Ctrl+C at specified second interval.

    So the command:

    COMMAND = 'nvidia-smi -l 1 --query-gpu=memory.used --format=csv'
    sp.check_output(COMMAND.split())
    

    will never terminate and return.

    It works if you remove the event loop from the command(nvidia-smi) to python.

    Here is the code:

    import subprocess as sp
    import os
    from threading import Thread , Timer
    import sched, time
    
    def get_gpu_memory():
        output_to_list = lambda x: x.decode('ascii').split('\n')[:-1]
        ACCEPTABLE_AVAILABLE_MEMORY = 1024
        COMMAND = "nvidia-smi --query-gpu=memory.used --format=csv"
        try:
            memory_use_info = output_to_list(sp.check_output(COMMAND.split(),stderr=sp.STDOUT))[1:]
        except sp.CalledProcessError as e:
            raise RuntimeError("command '{}' return with error (code {}): {}".format(e.cmd, e.returncode, e.output))
        memory_use_values = [int(x.split()[0]) for i, x in enumerate(memory_use_info)]
        # print(memory_use_values)
        return memory_use_values
    
    
    def print_gpu_memory_every_5secs():
        """
            This function calls itself every 5 secs and print the gpu_memory.
        """
        Timer(5.0, print_gpu_memory_every_5secs).start()
        print(get_gpu_memory())
    
    print_gpu_memory_every_5secs()
    
    """
    Do stuff.
    """