Is there any way that I can convert my cpu parallel code to the cuda?

I have written a code which works on CPU cores very well. But it's not fast enough for me. I want to run it on CUDA cores and I already tried to write a kernel for the montecarlo part of it and etc. But whatever I do I cannot run it properly.

So I have two questions :

Is it a good idea to run it on CUDA? Will there be any difference ?
Can you show me a method that I can change it to a CUDA version ?

import os
import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import trange
from numba import jit
from numba import misc as msc
import random
import itertools
from joblib import Parallel, delayed  # Import joblib for parallel processing
from scipy.special import binom, comb
from math import factorial
import pickle

# # Ensure 'output' folder exists in the current working directory
# script_dir = os.path.dirname(os.path.realpath(__file__))
# output_dir = os.path.join(script_dir, 'output')
# if not os.path.exists(output_dir):
#     os.makedirs(output_dir)

# Running simulation
n = 2  # Number of spins in the many-body system
omegas = np.arange(0.1, 12, 0.2)
ents = []

# Parameters
dt = 0.002
kappa = 1.0
N = 2  # Number of trajectories
nt = 100000  # Number of timesteps which makes total time T = nt * dt = 500
J = n // 2

def precompute_binomial_coefficients(N):
    binomial_coeffs = np.zeros((N + 1, N + 1))
    for n in range(N + 1):
        for k in range(n + 1):
            binomial_coeffs[n, k] = comb(n, k)
    return binomial_coeffs
binomial_coeffs = precompute_binomial_coefficients(N)


# Incase it's needed
@jit(nopython=True)
def dicke_basis(n, m):
    """Generate Dicke basis states for a specified number of spins and value of m."""
    basis_states = []
    num_up_spins = int(m + (J))
    for positions in itertools.combinations(range(n), num_up_spins):
        state = np.zeros(n, dtype=np.int8)
        for pos in positions:
            state[pos] = 1
        basis_states.append(state)
    return basis_states

@jit(nopython=True)
def s_z_operator(n):
    return np.diag(np.arange((J), (-J) - 1, -1))

@jit(nopython=True)
def s_minus_operator(n):
    s_minus = np.zeros((int(n + 1), int(n + 1)), dtype=np.complex128)
    for m in range(1, int(n + 1)):
        S_z = (n / 2) - (m - 1)
        s_minus[m, m - 1] = np.sqrt((J) * ((J) + 1) - S_z * (S_z - 1))
    return s_minus

@jit(nopython=True)
def s_plus_operator(n):
    return s_minus_operator(n).T.conj()

@jit(nopython=True)
def s_x_operator(n):
    return 0.5 * (s_plus_operator(n) + s_minus_operator(n))

@jit(nopython=True)
def s_y_operator(n):
    return -1j * (s_plus_operator(n) - s_minus_operator(n)) * 0.5




@jit(nopython=True)
def normstate(state):
    return np.real(np.dot(state.conj().T, state))
s_z = s_z_operator(n)
s_m = s_minus_operator(n)
s_x = s_x_operator(n)
s_y = s_y_operator(n)
s_p = s_plus_operator(n)

for i in range(len(omegas)):

 

  #Full density matrix
  @jit(nopython=True)
  def full_density_matrix(P_ne):
      """Construct the full density matrix from probability amplitudes P_ne."""
      rho_N = np.outer(P_ne, P_ne.conj())
      return rho_N

  

  # Calculating reduced density matrix
  @jit(nopython=True)
  def reduced_density_matrix(rho_N, k, binomial_coeffs):
      """Calculate the reduced density matrix of k spins."""
      N = rho_N.shape[0] - 1
      L = N - k  # Number of spins in the remaining partition
      rho_k = np.zeros((k + 1, k + 1), dtype=np.complex128)

      for ne in range(N + 1):
          for ne_prime in range(N + 1):
              for le in range(min(ne, k) + 1):
                  coef = (
                      binomial_coeffs[L, le]
                      * np.sqrt(
                          binomial_coeffs[N - L, ne - le]
                          * binomial_coeffs[N - L, ne_prime - le]
                          / (binomial_coeffs[N, ne] * binomial_coeffs[N, ne_prime])
                      )
                  )
                  if 0 <= (ne - le) <= k and 0 <= (ne_prime - le) <= k:
                      rho_k[ne - le, ne_prime - le] += coef * rho_N[ne, ne_prime]

      return rho_k

  # Neumann entanglement entropy
  @jit(nopython=True)
  def entanglement_entropy(rho_k):
      """Calculate the entanglement entropy from the reduced density matrix rho_k."""
      eigenvalues = np.linalg.eigvalsh(rho_k)
      entropy = -np.sum(eigenvalues * np.log(eigenvalues + 1e-12))  # Avoid log(0)
      return entropy

  #Monte carlo process
  @jit(nopython=True)
  def montecarlo(state_init, nt, dt, binomial_coeffs):
      """Perform a Monte Carlo simulation for a single trajectory in Dicke basis."""
      state = state_init.copy()  # Use a copy to avoid modifying the original
      trajectory = np.zeros(nt + 1, dtype=np.float64)  # Preallocate trajectory array
      trajectory[0] = normstate(state)  # Store initial state norm
      num_jumps = 0
      entropies = np.zeros(nt + 1)
      eye = np.eye(n + 1, dtype=np.complex128)
      spm = np.dot(s_p, s_m)
      for step in range(nt):
          new_state = np.dot((eye - 1j * H_eff * dt), state)  # Evolve state
          norm_new_state = normstate(new_state)
          r = np.random.rand()
          p = dt * normstate(np.dot(op_coll, state))

          if r < p:
              # Calculate J_+ J_- expectation value for normalization
              norm_factor = np.dot(state.conj().T, np.dot(spm, state))

              # Apply the jump and normalize by sqrt(norm_factor)
              jump_state = np.dot(op_coll, state)
              state = jump_state / np.sqrt(norm_factor)
              num_jumps += 1
          else:
              state = new_state / np.sqrt(norm_new_state)
          trajectory[step + 1] = np.abs(state[0]) ** 2
          rho_N = full_density_matrix(state)
          k = J
          rho_k = reduced_density_matrix(rho_N, k, binomial_coeffs)  # Pass binomial_coeffs here
          entropies[step + 1] = entanglement_entropy(rho_k)
      return trajectory, num_jumps, entropies


  def run_trajectories(N, nt, dt, binomial_coeffs):
      """Run N trajectories in parallel and average the results."""
      # Use joblib to run Monte Carlo simulations in parallel
      results = Parallel(n_jobs=-1)(
          delayed(montecarlo)(state_init, nt, dt, binomial_coeffs) for _ in trange(N)
      )

      # Unzip the results
      trajectories, jumps, entropies = zip(*results)

      trajectories = np.array(trajectories)  # Convert list of trajectories to numpy array
      jumps = np.array(jumps)  # Convert list of jumps to numpy array
      entropies = np.array(entropies)  # Convert list of entropies to numpy array
      # Average the trajectories manually
      # averaged = np.zeros(nt + 1, dtype=np.float64)
      # for j in range(nt + 1):
      #     averaged[j] = np.sum(trajectories[:, j]) / N

      return entropies




  omega_0 = omegas[i]
  
  H = omega_0 * s_x
  H_eff = H - 1j * (kappa / (2 * (n / 2))) * np.dot(s_p, s_m)
  op_coll = (kappa / J) * s_m.astype(np.complex128)  # Ensure op_coll is complex128
  state_init = np.zeros(n + 1, dtype=np.complex128)
  state_init[0] = 1.0  # Initial state in Dicke basis
  entropies = run_trajectories(N, nt, dt, binomial_coeffs)

  avgent = np.mean(entropies, axis=0)
  ents.append(avgent)
  plt.plot(ents[i])
  plt.show()
  plt.cla()
  # Save omegas and entropies in the 'output' folder

#entropy_file = os.path.join(output_dir, f"ents-{n}.npy")
np.save(f"ents-{n}.npy", ents)  # Use np.save to store numpy arrays
print(f"Saved results for n={n} in the output folder.")

Solution

But it's not fast enough for me

Before trying to run this on a GPU, you should really profile the CPU version. It turns ou most of the time is spent compiling the code... Indeed, the JIT functions are in a loop so they are redefined and recompiled every-time. Compilation is an expensive process so you should really avoid that. The functions should be defined and compiled once for sake of performance (you can specialize the functions in some cases, but this makes no sens when the compilation time is bigger than the run time).

It turns out that some functions access global variables. Be aware that these variables are considered as compile-time constant in Numba (and they theoretically cannot be mutated). Numba can replace the value by constant in the code and even pre-compute values depending on them (which can sometimes be pretty slow and generate huge assembly codes). You should avoid using large global arrays and never write into them. In fact, it is generally better to pass such array to functions because global variables tends to be bad in term of software engineering, especially when they can be mutated (it is like a "spooky action at a distance" ;) ).

Using Joblib is not a good idea with Numba because it might create processes that needs to compile Numba functions for each processes (which is very inefficient). Numba supports multithreading thanks to parallel=True and prange-based loop which are generally more efficient than using Joblib so to call jitted Numba functions (and also more flexible when it comes to optimizations).

I strongly advise you to improve the CPU version first and check all cores are at least use at they full capacity before using GPUs.

Is it a good idea to run it on CUDA?

For a code to run faster on GPU, you need a lot of parallelism; significantly more than on CPU. The thing is GPUs are not just CPUs with more cores, but a very different kind of hardware with a lot of constraints. The code also needs to be SIMT-friendly. Basically, you need to split the work so several thousands of CUDA threads can operate simultaneously (even dozens of thousands of threads ideally). Moreover, the code should be written so CUDA threads access memory in a coalesced way for sake of performance. The code should also avoid warp divergence. There are other strong restrictions: a thread needs not to use too many registers (otherwise it will decrease the occupancy and so generally the performance too), GPU kernels can only allocate a pretty-limited amount of memory (e.g. 16 MiB, always preallocated, and possibly modified before all kernel execution). As a result, efficiently porting relatively-big codes (like this one) to GPUs is often pretty-difficult.

Here n and N are far too small; only nt is big enough to provide enough parallelism assuming the nt-based loop can be executed in parallel safely (i.e. no loop-carried dependencies). I wonder if state is a dependency... If the loop needs to be executed serially, then there is no chance for this to be executed efficiently on GPU unless I miss any other source of parallelism.

Even assuming there is enough parallelism, I think porting efficiently this code to GPU is a relatively huge work for a beginner!

Can you show me a method that I can change it to a CUDA version ?

The standard way in Numba is to write Numba CUDA kernels. This is explained in the documentation so please read it.

Alternatively, you can write the code so all operation are vectorized in Numpy and then easily move the work on GPU using CuPy. That being said, this is often not as efficient as writing GPU kernels (and people use GPUs in the first place so to make things faster). I am not sure this (i.e. using only Numpy vectorized function or trivial mapping functions) can be done easily here or even efficiently...