pythonnumpygpunumbapde

Solving 1D heat equation on GPU in Numba


I am new to the use of GPUs, and I am trying to write a kernel in Numba to solve numerically the 1D heat equation. I also wrote a Numpy version of the PDE solver, and it turned out that the GPU kernel doesn't provide the correct result. Below I show a comparison of the state vectors computed by the two scripts:

enter image description here

Moreover, the kernel generates slightly different results at each run. It's probably some problem related to threads management, even though I synchronized the threads at every iteration. Some help would be really appreciated.

from numba import cuda, void, float32
import numpy as np
import scipy.stats as stats
import time
import matplotlib.pyplot as plt

    
##################### Numba GPU Version


@cuda.jit(void(float32[::1], float32[::1]))
def solve_pde(u, parameters):

    # Space and time parameters
    dx = parameters[0]
    dt = parameters[1]
    t = parameters[2]
    t_end = parameters[3]
    u_size = u.size

    # Index of thread on GPU
    i = cuda.grid(1)

    # Condition to avoid threads accessing indices out of array
    if i < u_size:

        while t < t_end:
            
            if(i in [0, 1, u_size-2, u_size-1]):  # Natural boundary conditions
                
                u[i] = np.float32(0.)
            
            else:
                
                # Compute second order derivatives
                RHS = np.float32(0.005)*(u[i + 1] - 2*u[i] + u[i - 1])/(dx*dx)

                # Update state vector
                u[i] += RHS*dt

            # Update time
            t += dt
            
            # Wait until all threads finish computing
            cuda.syncthreads()


# Space and time parameters
dx = 0.01
dt = 0.01
t0 = 0
t_end = 200
parameters = np.array([dx, dt, t0, t_end], dtype="float32")

# Initial state vector
x = np.linspace(0, 6, int(6/dx), dtype="float32")
u = np.array(stats.norm.pdf(x, 3, 0.3), dtype="float32")

# Manage the number of threads
threads_per_block = 32
blocks_per_grid = (u.size + (threads_per_block - 1)) \
        // threads_per_block

# Send the state vector and the parameters to the device
d_u = cuda.to_device(u)
d_parameters = cuda.to_device(parameters)

# Start timer
start = time.perf_counter()

# Start parallel simulations
solve_pde[blocks_per_grid, threads_per_block](d_u, d_parameters)

# Move the final state vector to the host
u_end = d_u.copy_to_host()

# Measure the time elapsed and print the result
end = time.perf_counter()
print(end - start)

# Plot the final state vector
plt.figure(figsize=(14, 10))
plt.plot(x, u_end, 'b-')



##################### Numpy Version



u = np.array(stats.norm.pdf(x, 3, 0.3), dtype="float32")
u_size = u.size

t = t0


while t < t_end:
    
    for i in range(u_size):
        
        if(i in [0, 1, u_size-2, u_size-1]):
            
            u[i] = 0
        
        else:
               
            RHS = 0.005*(u[i + 1] - 2*u[i] + u[i - 1])/(dx*dx)
            u[i] += RHS*dt
    
    t += dt


plt.figure(figsize=(14, 10))
plt.plot(x, u, 'r-')  

Solution

  • The issue certainly comes from u being both read and written by GPU threads at the same time causing a race condition. You need to works on two different buffers so to prevent this problem. Note that you can swap the buffers at the end of a computation step.

    Moreover, note that cuda.syncthreads does not "Wait until all threads finish computing". It is block level synchronization barrier. AFAIK, If you want to wait for all thread to finish for a given time-step, the only way is to run the CUDA kernel once again (one per time-step).

    Note that running a kernel is quite expensive so running such a computation on a GPU is only useful compared to CPUs if the array to be computed is huge (eg. certainly at least 100_000 in your case). Besides this, note that 1.0/(dx*dx) can be precomputed to avoid a slow division.