pythonnumpypde

Why is reshape/ravel in NumPy faster (or about as fast) than explicit flat indexing for 2D PDE solvers?


I'm solving a 2D PDE using solve_ivp from SciPy and was trying to optimize my function by avoiding calls to .reshape() and .ravel() on every timestep. I rewrote my dudt function to operate purely in 1D using flat indexing and manually computed neighbor indices to apply the Laplacian stencil.

This leads to three questions:

  1. Is there a silly mistake I didn't catch and is my re-indexed version slow because of it.

  2. Is reshape() (and ravel()) really that fast in NumPy? Does it make a copy or just return a view?

  3. Why isn't the "fully vectorized flat indexing" version faster, even though it avoids reshaping altogether?

Any insights into how NumPy handles memory views, indexing, and performance in this context would be appreciated.

PDE Solution Output with Heat Source and Heat Sink

Here's my full code:

import os

import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt

# +++++++++++++++++++++++++++++

t_start     =    0
t_end       =    1
N_timesteps =  100
my_t        = np.linspace(t_start, t_end, N_timesteps)

x_start    =   0
y_start    =   0
x_end      =   2
y_end      =   2
Nx_spatial =  80
Ny_spatial = 100
Ni         = Nx_spatial
Nj         = Ny_spatial
my_x       = np.linspace(x_start, x_end, Nx_spatial)
my_y       = np.linspace(x_start, x_end, Ny_spatial)
my_dx      = my_x[1] - my_x[0]
my_dy      = my_y[1] - my_y[0]
X, Y       = np.meshgrid(my_x, my_y)

my_u0      = np.zeros_like(X)

def gaussian_2d(x, y, a, mx, my, sx, sy): return a * np.exp(-((x - mx)**2 / (2 * sx**2) + (y - my)**2 / (2 * sy**2)))
my_b       = gaussian_2d(X, Y, a=100, mx=0.5, my=0.5, sx=0.1, sy=0.1) + chm.gaussian_2d(X, Y, a=-100, mx=1.5, my=1.0, sx=0.1, sy=0.1)

# my re-indexization matrices

K_center = np.arange(Ni*Nj).reshape(Nj, Ni)
K_right  = np.roll(K_center,  1, axis = 1)
K_left   = np.roll(K_center, -1, axis = 1)
K_up     = np.roll(K_center,  1, axis = 0)
K_down   = np.roll(K_center, -1, axis = 0)

center   = K_center.ravel()
right    = K_right.ravel()
left     = K_left.ravel()
up       = K_up.ravel()
down     = K_down.ravel()

def dudt(t, u_flat, dx, dy, b):
    u       = u_flat.reshape(Ny_spatial, Nx_spatial)
    
    # Neumann
    u[ 0, :] = u[ 1, :]
    u[-1, :] = u[-2, :]
    u[:,  0] = u[:,  1]
    u[:, -1] = u[:, -2]
    
    d2u_dx2 = (1/dx**2) * (np.roll(u, 1, axis=0) - 2*u + np.roll(u, -1, axis=0))
    d2u_dy2 = (1/dy**2) * (np.roll(u, 1, axis=1) - 2*u + np.roll(u, -1, axis=1))
    
    # Hm... this re-indexed variant ain't actually faster. :(
    # d2u_dy2 = (1/dy**2) * (u[up] - 2*u[center] + u[down])
    # d2u_dx2 = (1/dx**2) * (u[right] - 2*u[center] + u[left])
    #
    du_dt   = (d2u_dx2 + d2u_dy2) + b 
    
    # Dirichlet
    du_dt[:,  0] = 0
    du_dt[:, -1] = 0
    du_dt[ 0, :] = 0
    du_dt[-1, :] = 0
    #

    return  du_dt.ravel()
    # return du_dt

sol = sp.integrate.solve_ivp(
    fun    = lambda t, u: dudt(t, u, dx=my_dx, dy=my_dy, b=my_b),
    t_span = (t_start, t_end),
    y0     = my_u0.ravel(),
    method = "RK45",
    t_eval = my_t,
    )

    
print("ok I'm plotting now.")
from matplotlib.animation import FuncAnimation
    
if True:
    fig, ax = plt.subplots()
    im = ax.imshow(sol.y[:, 0].reshape(Ny_spatial, Nx_spatial),
                   extent=[x_start, x_end, y_start, y_end],
                   origin='lower',
                   cmap='viridis',
                   vmin=np.min(sol.y), vmax=np.max(sol.y)+0.1)
    cbar = plt.colorbar(im, ax=ax)

    # Create meshgrid for quiver (skip every 5 points)
    x = np.linspace(x_start, x_end, Nx_spatial)
    y = np.linspace(y_start, y_end, Ny_spatial)
    X, Y = np.meshgrid(x, y)

    skip = (slice(None, None, 5), slice(None, None, 5))  # every 5th point
    quiv = ax.quiver(X[skip], Y[skip], np.zeros_like(X[skip]), np.zeros_like(Y[skip]), color='white', scale=100)

    def update(frame):
        u = sol.y[:, frame].reshape(Ny_spatial, Nx_spatial)
        im.set_array(u)

        # Compute gradient (dy, dx order!)
        dy, dx = np.gradient(u, y, x)
        quiv.set_UVC(-dx[skip], -dy[skip])

        ax.set_title(f"t = {my_t[frame]:.2f}")
        return [im, quiv]

    ani = FuncAnimation(fig, update, frames=N_timesteps, interval=50, blit=False)
    plt.show()


Solution

  • Is reshape() (and ravel()) really that fast in NumPy? Does it make a copy or just return a view?

    reshape try not to do any copy if possible but there are cases where this is required. For example, if you do arr.T.reshape(-1), the .T does a lazy transposition and then .reshape(-1) performs a copy because the 1D array cannot be represented as a view (with a single stride value). Note reshape has a copy parameter you can set to False to be sure that there are no copy done, or to enforce a copy if you wish.

    ravel returns a contiguous flatten array so if the input array is not a contiguous ND array, it will perform a copy. reshape tends to perform slightly less copy in practice than ravel because reshape might return a view which is not contiguous.

    When a view is returned without doing any copy, the function is very fast (typically no more than few µs on a mainstream PC). When a copy is performed, the operation is significantly slower. Indeed, there is not only a copy to perform but also often the expensive overhead of page faults to pay (i.e. writing in newly created array located in a new memory area for the first time).

    Why isn't the "fully vectorized flat indexing" version faster, even though it avoids reshaping altogether?

    The problem with indexing is it is not SIMD friendly. Moreover, operations like u[center] create a new expensive temporary array, not a cheap view on u. THis specific one is fundamentally inefficient and necessary: u[center] should AFAIK be simply u. Additionally, the indexing array needs to be read (which is far from being cheap since the operation is likely be memory bound).

    Meanwhile, while np.roll also create a new temporary array, it is SIMD-friendly and read less data from memory. Moreover, you use u directly (instead of u[center]) so this part of the expression does not produce an expensive temporary array.


    Faster implementation

    It seems you want to do a stencil but there is no need for np.roll nor indexing. You can just use views and don't care about the borders since they are overwritten just after the computation. Here is an (untested) example

    tmp_2u = 2 * u[1:-1,1:-1]                              # Compute the center only once
    d2u_dx2 = (1/dx**2) * (u[2:,:] - tmp_2u + u[:-2,:])    # Use cheap views instead of indexing
    d2u_dy2 = (1/dy**2) * (u[:,2:] - tmp_2u + u[:,:-2])
    
    du_dt = np.zeros_like(u)                               # For borders
    du_dt[1:-1,1:-1] = (d2u_dx2 + d2u_dy2) + b[1:-1,1:-1]
    

    This should be significantly faster but still far from being optimal. Indeed, this still creates many temporary arrays. For example, 2 * u[1:-1,1:-1] creates a new array, same for u[2:,:] - tmp_2u and then for the final array d2u_dx2 or also for d2u_dx2 + d2u_dy2.

    You can reduce the number of new temporary arrays with in-place Numpy operations. Here is an example:

    tmp_2u = 2 * u[1:-1,1:-1]
    
    tmp = u[2:,:] - tmp_2u
    np.add(tmp, u[:-2,:], out=tmp)
    d2u_dx2 = np.multiply(1/dx**2, tmp, out=tmp)
    
    tmp = u[:,2:] - tmp_2u
    np.add(tmp, u[:,:-2], out=tmp)
    d2u_dy2 = np.multiply(1/dy**2, tmp, out=tmp)
    
    du_dt = np.zeros_like(u)
    tmp = du_dt[1:-1,1:-1]
    np.add(d2u_dx2, d2u_dy2, out=tmp)      # Writes directly in du_dt[1:-1,1:-1]
    np.add(tmp, b[1:-1,1:-1], out=tmp)
    

    This only creates 3 temporary arrays instead of 9 of the previous code. It is also still SIMD-friendly.

    For better performance, please consider using modules specialised for stencil computations, or to use Numba/Cython. They avoid creating temporary arrays and better use the memory so to produce a faster code also taking less memory.