I'm running a parallel process using numba and CUDA (on windows) that will take quite a while. It would be nice to have an updating progress bar printed in the console, so I can see how far through all the threads it is. Something like tqdm would be absolutely perfect, but for CUDA.
I've tried using tqdm, and numba-progress, but neither seem to work with CUDA. I've also tried my own class based solution but alas, you cannot pass classes into a kernel function (i think). I found this thread which also describes the problem I want to solve, but no replies. All other posts I've found have not been for CUDA.
Here's some skeleton code of what I'd like to put a progress bar on:
from __future__ import print_function, absolute_import
from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
import numpy as np
from math import gamma, exp, ceil
# This function is just an example of what i'd like to put a progress bar on
@cuda.jit
def generate_samples(rng_states, out, rate):
thread_id = cuda.grid(1)
def poission_sample(rate, random_number):
probability_sum = 0
index = -1
while probability_sum < random_number:
index += 1
probability_sum += ((rate**index)/gamma(index+1)) * exp(-rate)
return index
# Ideally increment a global counter of some kind here, or have a module that does it for me
out[thread_id] = poission_sample(rate, xoroshiro128p_uniform_float32(rng_states, thread_id))
number_of_samples = 10000000
threads_per_block = 512
blocks = ceil(number_of_samples/threads_per_block)
rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
out = np.zeros(threads_per_block * blocks, dtype=np.float32)
generate_samples[blocks, threads_per_block](rng_states, out, 5)
print('Average Sample:', out.mean())
Any help would be massively appreciated!
You may be able to use a numba cuda mapped_array to help with this task. Underneath the hood, this is telling numba to create a pinned allocation and make it usable on the device, which informs numba not to copy it to the device, even though a pinned_array normally appears to numba like a host array.
Coupled with that, we will need to make sure that numba is not trying to copy arrays, as that will result in synchronization in the "automatic" case, which we don't want.
I don't really know how to measure the progress of that algorithm. For example, the while loop in poisson_sample
seems to iterate 4 times on the item whose thread_id
is zero, but I doubt that is true across the out
array. (I do have a better idea about how to monitor the progress of other algorithms.)
If we know how long an algorithm should take based on the progress, then we can simply monitor the value reported by the kernel. When it gets to 100% (or nearly), we stop monitoring and proceed with the rest of the work.
I'll arbitrarily decide for demonstration purposes that the progress of this algorithm is measured by the number of threads that have completed the work.
When we are unable to determine progress based on the progress report from the kernel (e.g. your case, for me, anyway) then an alternative is to continue to monitor and report progress until kernel completion is signalled by an event.
Anyhow, the following works for me on linux, as a rough sketch. This is demonstrating with the use of events, although if you know the progress of the algorithm, the events are not really needed. Here is the version with events:
$ cat t1.py
from __future__ import print_function, absolute_import
from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
import numpy as np
from math import gamma, exp, ceil
# This function is just an example of what i'd like to put a progress bar on
@cuda.jit
def generate_samples(rng_states, out, rate, progress):
thread_id = cuda.grid(1)
def poission_sample(rate, random_number, progress):
probability_sum = 0
index = -1
while probability_sum < random_number:
index += 1
probability_sum += ((rate**index)/gamma(index+1)) * exp(-rate)
cuda.atomic.add(progress, 0, 1)
return index
out[thread_id] = poission_sample(rate, xoroshiro128p_uniform_float32(rng_states, thread_id), progress)
number_of_samples = 10000000
progress = cuda.mapped_array(1, dtype=np.int64)
progress[0] = 0;
last_pct = 0
my_e = cuda.event()
threads_per_block = 512
blocks = ceil(number_of_samples/threads_per_block)
my_divisor = (threads_per_block * blocks) // 100
rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
out = np.zeros(threads_per_block * blocks, dtype=np.float32)
out_d = cuda.device_array_like(out)
generate_samples[blocks, threads_per_block](rng_states, out_d, 5, progress)
my_e.record()
print(last_pct)
while my_e.query() == False:
cur_pct = progress[0]/my_divisor
if cur_pct > last_pct + 10:
last_pct = cur_pct
print(cur_pct)
out = out_d.copy_to_host()
print('Average Sample:', out.mean())
$ python3 t1.py
0
10.00129996100117
20.00291991240263
30.004539863804087
40.00519984400468
50.00713978580642
60.00811975640731
70.00941971740848
80.01039968800936
90.01105966820995
Average Sample: 5.000568
$
Here is a version without events:
$ cat t2.py
from __future__ import print_function, absolute_import
from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
import numpy as np
from math import gamma, exp, ceil
# This function is just an example of what i'd like to put a progress bar on
@cuda.jit
def generate_samples(rng_states, out, rate, progress):
thread_id = cuda.grid(1)
def poission_sample(rate, random_number, progress):
probability_sum = 0
index = -1
while probability_sum < random_number:
index += 1
probability_sum += ((rate**index)/gamma(index+1)) * exp(-rate)
cuda.atomic.add(progress, 0, 1)
return index
out[thread_id] = poission_sample(rate, xoroshiro128p_uniform_float32(rng_states, thread_id), progress)
number_of_samples = 10000000
progress = cuda.mapped_array(1, dtype=np.int64)
progress[0] = 0;
last_pct = 0
threads_per_block = 512
blocks = ceil(number_of_samples/threads_per_block)
my_divisor = (threads_per_block * blocks) // 100
rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
out = np.zeros(threads_per_block * blocks, dtype=np.float32)
out_d = cuda.device_array_like(out)
generate_samples[blocks, threads_per_block](rng_states, out_d, 5, progress)
print(last_pct)
while last_pct < 90:
cur_pct = progress[0]/my_divisor
if cur_pct > last_pct + 10:
last_pct = cur_pct
print(cur_pct)
out = out_d.copy_to_host()
print('Average Sample:', out.mean())
$ python3 t2.py
0
10.000019999400019
20.000039998800037
30.000059998200054
40.000079997600075
50.00009999700009
60.00011999640011
70.00013999580013
80.00015999520015
90.00017999460016
Average Sample: 5.000568
$
I ran both of these on linux. The version without the use of events may work better on windows, or possibly the other way (the event query may push work submission along). If you are using a display GPU on windows (i.e. a GPU not in TCC mode), then WDDM work batching/scheduling may possibly present an issue. You could try both settings for Windows Hardware Accelerated GPU Scheduling to see if one option works better than the other.
Also, this kernel runs in less than a second on my GPU (the kernel duration is about 300ms, actually, on my GTX 970 GPU). So this might not be an interesting test case.