pythonnumpyperformancescipy

SCIPY BPoly.from_derivatives compared with numpy


I implemented this comparison between numpy and scipy for doing the same function interpolation. The results show how numpy crushes scipy.

Python version: 3.11.7
NumPy version: 2.1.3
SciPy version: 1.15.2
Custom NumPy interpolation matches SciPy BPoly for 10 000 points.
SciPy coeff time: 0.218046 s
SciPy eval time : 0.000725 s
Custom coeff time: 0.061066 s
Custom eval time : 0.000550 s

edit: likely, I was too pessimistic below when giving the 4x to 10x, after latest streamlining of code, seems very significant on average.

Varying on system, I get somewhere between 4x to 10x speedup with numpy. This is not the first time I encounter this. So in general, I wonder:

why this huge difference in perf? should we do well in viewing scipy as just a reference implementation, and go to other pathways for peformance (numpy, numba)?

# BPoly.from_derivatives example with 10 000 sinusoidal points and timing comparison
"""
Creates a sinusoidal dataset of 10 000 points over [0, 2π].
Interpolates using SciPy's BPoly.from_derivatives and a custom pure NumPy quintic Hermite.
Verifies exact match, compares timing for coefficient computation and evaluation, and visualizes both.
"""
import sys
import numpy as np
import scipy
import time
from scipy.interpolate import BPoly

# Environment versions
print(f"Python version: {sys.version.split()[0]}")
print(f"NumPy version: {np.__version__}")
print(f"SciPy version: {scipy.__version__}")

# Generate 10 000 sample points over one period
n = 10_000
x = np.linspace(0.0, 2*np.pi, n)
# Analytical sinusoidal values and derivatives
y = np.sin(x)          # y(x)
v = np.cos(x)          # y'(x)
a = -np.sin(x)         # y''(x)

# === SciPy implementation with timing ===
y_and_derivatives = np.column_stack((y, v, a))

t0 = time.perf_counter()
bp = BPoly.from_derivatives(x, y_and_derivatives)
t1 = time.perf_counter()
scipy_coeff_time = t1 - t0

# Evaluation timing
t0 = time.perf_counter()
y_scipy = bp(x)
t1 = time.perf_counter()
scipy_eval_time = t1 - t0

# === Pure NumPy implementation ===
def compute_quintic_coeffs(x, y, v, a):
    """
    Compute quintic Hermite coefficients on each interval [x[i], x[i+1]].
    Returns coeffs of shape (6, n-1).
    """
    m = len(x) - 1
    coeffs = np.zeros((6, m))
    for i in range(m):
        h = x[i+1] - x[i]
        A = np.array([
            [1, 0,    0,      0,       0,        0],
            [0, 1,    0,      0,       0,        0],
            [0, 0,    2,      0,       0,        0],
            [1, h,  h**2,   h**3,    h**4,     h**5],
            [0, 1,  2*h,   3*h**2,  4*h**3,   5*h**4],
            [0, 0,    2,    6*h,   12*h**2,  20*h**3],
        ])
        b = np.array([y[i], v[i], a[i], y[i+1], v[i+1], a[i+1]])
        coeffs[:, i] = np.linalg.solve(A, b)
    return coeffs


def interp_quintic(x, coeffs, xx):
    """
    Evaluate quintic Hermite using precomputed coeffs.
    x: breakpoints (n,), coeffs: (6, n-1), xx: query points.
    """
    idx = np.searchsorted(x, xx) - 1
    idx = np.clip(idx, 0, len(x) - 2)
    dx = xx - x[idx]
    yy = np.zeros_like(xx)
    for j in range(6):
        yy += coeffs[j, idx] * dx**j
    return yy

# Coefficient estimation timing for custom
t0 = time.perf_counter()
custom_coeffs = compute_quintic_coeffs(x, y, v, a)
t1 = time.perf_counter()
cust_coeff_time = t1 - t0

# Evaluation timing for custom
t0 = time.perf_counter()
y_custom = interp_quintic(x, custom_coeffs, x)
t1 = time.perf_counter()
cust_eval_time = t1 - t0

# Verify exact match
assert np.allclose(y_scipy, y_custom, atol=1e-12), "Custom interp deviates"
print("Custom NumPy interpolation matches SciPy BPoly for 10 000 points.")

# Print timing results
print(f"SciPy coeff time: {scipy_coeff_time:.6f} s")
print(f"SciPy eval time : {scipy_eval_time:.6f} s")
print(f"Custom coeff time: {cust_coeff_time:.6f} s")
print(f"Custom eval time : {cust_eval_time:.6f} s")

# Visualization

import matplotlib.pyplot as plt
plt.plot(x, y_scipy, label='SciPy BPoly')
plt.plot(x, y_custom, '--', label='Custom NumPy')
plt.legend()
plt.title('Sinusoidal interpolation: SciPy vs NumPy Quintic')
plt.xlabel('x')
plt.ylabel('y(x) = sin(x)')
plt.show()

Solution

  • Analysis of the coefficient computations

    Regarding BPoly.from_derivatives, Scipy uses a generic code with some (slow) pure-Python one inside and several calls to Numpy each ones having a small overhead. The Numpy functions used are also generic so the code is sub-optimal. For example, a low-level profiler on my Linux machine show that the slowest function is ufunc_generic_fastcall (10-12% of the time). The name is quite explicit: a generic ufunc, so clearly not the most efficient solution. >20% of the time is spent in pure-Python code (inefficient). I think the internal implementation is bound by Scipy overheads.

    Meanwhile compute_quintic_coeffs is bound by Numpy overheads. Indeed, m is roughly 10000 in this case so ~10_000 iterations, there is at least a dozen of Numpy calls in each iteration to perform and the code takes about 110 ms (at least on my machine with i5-9600KF CPU). This means, a bit more than 100_000 Numpy calls in 110 ms so about 1 µs/call. This is generally the overhead of a Numpy function call on my machine. A low-level profiling shows that most of the code seems to be overhead confirming the code is inefficient. Using Numba for that would help a lot to make the code faster.

    Thus, put it shortly, both implementation are very inefficient.


    Faster coefficient computation

    You can significantly improve the performance with Numba:

    @nb.njit('(float64[::1], float64[::1], float64[::1], float64[::1])')
    def compute_quintic_coeffs_nb(x, y, v, a):
        m = len(x) - 1
        coeffs = np.zeros((6, m))
    
        # Create `A` once for better performance
        A = np.zeros((6, 6))
        A[0, 0] = 1
        A[1, 1] = 1
        A[2, 2] = 2
        A[3, 0] = 1
        A[4, 1] = 1
        A[5, 2] = 2
    
        for i in range(m):
            h = x[i+1] - x[i]
    
            # Very fast setup of the `A` matrix
            h2 = h * h
            h3 = h2 * h
            h4 = h2 * h2
            h5 = h3 * h2
            A[3, 1] = h
            A[3, 2] = h2
            A[3, 3] = h3
            A[3, 4] = h4
            A[3, 5] = h5
            A[4, 2] = 2 * h
            A[4, 3] = 3 * h2
            A[4, 4] = 4 * h3
            A[4, 5] = 5 * h4
            A[5, 3] = 6 * h
            A[5, 4] = 12 * h2
            A[5, 5] = 20 * h3
    
            b = np.array([y[i], v[i], a[i], y[i+1], v[i+1], a[i+1]])
            coeffs[:, i] = np.linalg.solve(A, b)
    
        return coeffs
    

    This code takes only 7 ms so it is about 16 times faster! Most of the time is spent in np.linalg.solve and more specifically native optimized BLAS functions, which is very good. You can certainly write a specialized implementation for np.linalg.solve since most of the matrix values are either zeros or constants, not to mention the matrix is small so overheads in BLAS functions might be significant. Each calls to np.linalg.solve now takes about 700 ns which is very small.

    The Numba code is fast because it is specialized for one specific case. There is no generic code except the one of np.linalg.solve which actually call aggressively-optimized BLAS functions (focusing only on performance).

    If this is not enough you can even use multiple threads to do the computation in parallel. This is not trivial here since each threads need to operate on its own A (threads must not write in the same A matrix). Still, I expect this to be about 5~6 times faster than the sequential Numba implementation on my 6-core CPU. The resulting code would be more than 80 times faster than the initial one (and take only few milliseconds)!


    Analysis of the evaluation function

    Regarding the Scipy code most of the time is spent in the computation of pow (60%), (more specifically __ieee754_pow_fma and pow@GLIBC_2.2.5). The rest is the actual interpolation. Low-level profiling tends to indicate the implementation is rather good. It takes 0.96 ms/call on my machine.

    Regarding the Numpy code interp_quintic, most of the time is spent in the computation of pow (~45%) which is sad because this is not actually needed here. The binsearch takes about 15~20% of the time and the rest is due to the other computations of the code as well as Numpy internal overheads. Besides the unneeded pow, the implementation is also rather good. It takes 0.70 ms/call on my machine.

    So far, my guess is that the Scipy code is more expensive mainly because it computes items using pow(item, 0) and pow(item, 1) without any optimization for the exponent 0 and 1. In this case, this means the exponentiation code should be 33% slower (and more expensive). This tends to match with low-level profiling information. If Scipy was optimize this case, it would only take 80% of the current timings overall (based on profiling results). Consequently, the Scipy implementation would only be about 10% slower than Numpy which is not so significant. If you really care about such a small performance improvement, then you should write your own native specialized implementation (possibly with Cython or Numba). That being said, this is certainly not so easy to defeat Numpy in this specific case.

    The operation can be optimized further.


    Faster evaluation function implementation

    First of all, dx**0 is computed which is sad because it is just an array of 1. Moreover, the exponentiation can be replaced by an iterative multiplication here. Here is the resulting code:

    def interp_quintic_opt(x, coeffs, xx):
        idx = np.searchsorted(x, xx) - 1
        idx = np.clip(idx, 0, len(x) - 2)
        dx = xx - x[idx]
        dxj = dx.copy()
        yy = coeffs[0, idx]
        for j in range(1, 6):
            yy += coeffs[j, idx] * dxj
            dxj *= dx
        return yy
    

    This code takes 0.30 ms/call instead of 0.70 ms/call so it is 2.3 times faster on my machine. Now, binsearch and mapiter_get are the bottleneck of the function (the later is AFAIK the indirect indexing).


    Discussion about the performance of Numpy vs Scipy

    This section is more general and based on my experience so people might disagree (comments are welcome).

    Generally, generic codes tend to be far less inefficient than specialized ones. The later is significantly easier to optimize. For example, if you need to support many different data-type, providing an efficient specialized code for each is time consuming. Scipy developers tends to use Cython to speed some part of the code but not everything is cytonised yet and when this is done, it is often not optimal. The thing is Scipy often provide much more features than Numpy and the higher the number of feature the harder the code optimization (simply because of the increasing number of concerns in the same code). Meanwhile Numpy developers focus primarily on performance, even if it means having missing features.

    On top of that, function call overheads tend to increase with the number of supported features. Numpy functions are far from being cheap for computing basic things on small array. This is because of a complex internal dispatch of generic iterator which needs to support features like wrap-arround, broadcasting, bound checking, generic data-types support. Numpy developers specialize the code for many cases in order to make the operation fast for large array at the expense of more expensive function calls on small arrays. Scipy AFAIK often does not specialize the code because such think makes the code more complex, and so harder to maintain.

    If you care about performance, you should specialize your code to your specific use-case. You should vectorise the code so to really avoid Scipy/Numpy function call overheads. One is certainly slower than the other in this case but both will be pretty inefficient anyway so you should not care much about which one is slower in this case. Using Numba and Cython might give you a speed up but this is not guaranteed because Numpy code is often so optimized than a naive native specialized implementation can be slower than a slightly more generic one aggressively optimized. This clearly depend of the use-case though since many parameters needs to be considered. For example: Can the compiler optimize the computation for the target use-case (e.g. constant propagation, inlining, common sub-expression optimization, allocation optimizations)? Is the Numpy/Scipy operation vectorised using SIMD-instructions and more generally is the computation SIMD-friendly. Are they using parallel code (only BLAS operations can run in parallel in Numpy so far)?