python parallel-processing gpu pyopencl particle-filter

Maximum number of parallel processes on a simple CPU/GPU

I am trying to run a particle filter with 3000 independent particles. More specifically, I would like to run 3000 (simple) computations in parallel at the same time, so that the computation time remains short.

This task is designed for experimental applications on a laboratory equipment, so it has to be run on a local laptop. I cannot rely on a distant cluster of computers, and the computers that will be used are unlikely to have fancy Nvidia graphic cards. For instance, the current computer I'm working with has an Intel Core i7-8650U CPU and an Intel UHD Graphics 620 GPU.

Using the mp.cpu_count() from the multiprocessing Python library tells me that I have 8 processors, which is too few for my problem (I need to run several thousands of processes in parallel). I thus looked towards GPU-based solutions, and especially at PyOpenCL. The Intel UHD Graphics 620 GPU is supposed to have only 24 processors, does it mean I can only use it to run 24 processes at the same time in parallel ?

More generally, is my problem (running 3000 processes in parallel on a simple laptop using Python) realistic, and if yes which software solution would you recommend ?

EDIT

Here is my pseudo code. At each time step i, I am calling the function posterior_update. This function uses 3000 times and independently (once for each particle) the function approx_likelihood, which seems hardly vectorizable. Ideally, I would like these 3000 calls to take place independently and in parallel.

import numpy as np
import scipy.stats
from collections import Counter
import random
import matplotlib.pyplot as plt
import os
import time

# User's inputs ##############################################################

# Numbers of particles
M_out           = 3000

# Defines a bunch of functions ###############################################

def approx_likelihood(i,j,theta_bar,N_range,q_range,sigma_range,e,xi,M_in):
    
    return sum(scipy.stats.norm.pdf(e[i],loc=q_range[theta_bar[j,2]]*kk,scale=sigma_range[theta_bar[j,3]])* \
          xi[nn,kk]/M_in for kk in range(int(N_range[theta_bar[j,0]]+1)) for nn in range(int(N_range[theta_bar[j,0]]+1)))
    
def posterior_update(i,T,e,M_out,M_in,theta,N_range,p_range,q_range,sigma_range,tau_range,X,delta_t,ML):
         
    theta_bar = np.zeros([M_out,5], dtype=int)
    x_bar = np.zeros([M_out,M_in,2], dtype=int)
    u = np.zeros(M_out)
    x_tilde = np.zeros([M_out,M_in,2], dtype=int)    
    w = np.zeros(M_out)
    
    # Loop over the outer particles 
    for j in range(M_out):
                    
        # Computes the approximate likelihood u
        u[j] = approx_likelihood(i,j,theta_bar,N_range,q_range,sigma_range,e,xi,M_in)
    
    ML[i,:] = theta_bar[np.argmax(u),:]        
    # Compute the normalized weights w
    w = u/sum(u)
    # Resample
    X[i,:,:,:],theta[i,:,:] = resample(M_out,w,x_tilde,theta_bar)  
       
    return X, theta, ML

# Loop over time #############################################################
    
for i in range(T):
    
    print('Progress {0}%'.format(round((i/T)*100,1)))
        
    X, theta, ML = posterior_update(i,T,e,M_out,M_in,theta,N_range,p_range,q_range,sigma_range,tau_range,X,delta_t,ML)

Solution

These are some ideas, not an answer to your question:

Your main concern about how to determine the number of parallel processes you can run, is not so simple. Basically you can think of your computer running as many processes in parallel as CPU cores you have. But this ultimately depends on the operating system, the current work load of your computer, etc. Besides, you can send your data to your processes in chunks, not necessarily one item at a time. Or you can partition your data into the processes you have, e.g. 6 processes with 500 items each = 3000 items. The optimum combination will require some trial and error.
The GPU, on the other hand, has an enormous amount of workers available. If you have the Nvidia drivers and OpenCL installed, issue the command clinfo in your terminal to have an idea of the capabilities of your hardware.
One problem I see with using the GPU with your code, is that you need to pass the instructions to your device in C language. Your approx_likelihood function contains code dependent on libraries, that would be very difficult to replicate in C.
However, if you estimate that you are using these libraries to do something that you could code in C, give it a try. You could also consider using Numba.
I would start by using python's multiprocessing. Something in these lines:

import multiprocessing as mp

def f(j):
    return approx_likelihood(i, j, theta_bar, N_range, q_range, sigma_range, e, xi, M_in)

with mp.Pool() as pool:
    u = pool.map(f, range(M_out), chunksize=50)