Automatic smoothing parameters with make_smoothing_spline go wrong

Related to this question:
I would like to determine optimal smoothing parameters for any kind of time series, with mixed units, varying lengths, varying framerates, such as pixel or meter trajectories, radian or degree angles, velocities, etc. I tried make_smoothing_spline from scipy.interpolate, but this does not work so well.

I've got 2 normalized time series (mean 1e-16≈0, std=1). Both curves are almost identical, but smoothing them with make_smoothing_spline gives very different results when lam is automatically determined with the GCV criterion. One is correctly smoothed (green), the other is completely flattened (red).

I picked out the code of make_smoothing_spline and saw that there is a call to scipy.interpolate._bsplines._compute_optimal_gcv_parameter(X, wE, y, w). Calling it on the first curve gives lam = 4.9e-06, while the other gives 154.0. Note that I had similar outcomes with much more dynamic and noisy time series.

Should I simply not use normalized data? But then I have the opposite problem, with some data not being filtered at all.
And most importantly, why does it behave like so? Is it some sort of numerical precision issues? Any idea how to handle it?

Below are some data and code to reproduce the issue.

import pickle
import matplotlib.pyplot as plt
from scipy.interpolate import make_smoothing_spline
from scipy.interpolate._bsplines import _compute_optimal_gcv_parameter

# Retrieve data
with open('good_lam.pkl', 'rb') as f_good:
    x_good, X_good, wE_good, y_good, w_good = pickle.load(f_good)
with open('bad_lam.pkl', 'rb') as f_bad:
    x_bad, X_bad, wE_bad, y_bad, w_bad = pickle.load(f_bad)

# Compute make_smoothin_spline 
spline_good = make_smoothing_spline(x_good, y_good, w=None, lam=None)
y_good_smooth = spline_good(x_good)
spline_bad = make_smoothing_spline(x_bad, y_bad, w=None, lam=None)
y_bad_smooth = spline_bad(x_bad)

# Compute lam
print('Good lam: ', _compute_optimal_gcv_parameter(X_good, wE_good, y_good, w_good))
print('Bad lam: ', _compute_optimal_gcv_parameter(X_bad, wE_bad, y_bad, w_bad))

# Plot input and smoothed data
plt.plot(y_good, label='good', color='green')
plt.plot(y_good_smooth, label='good_smoothed', color='palegreen')
plt.plot(y_bad, label='bad', color='red')
plt.plot(y_bad_smooth, label='bad_smoothed', color='lightpink')
plt.legend()
plt.show()

Solution

One of the things that strikes me as unusual about what _compute_optimal_gcv_parameter is doing is the norm of the matrices involved. In the example you're using, for the bad dataset, the Frobenius norm of XtWX is ~7.5, but the norm of XtE is ~1.9e+08.

Then, this algorithm is computing the value

lhs = XtWX + lam * XtE

For reasonable values of lam, this will result in XtWX being essentially ignored.

For this reason, one thing that I found that helped quite a bit was to use unit-spaced x. In other words, use values of x that are exactly 1 apart. When I tried this, the norms of XtWX and XtE were 16 and 40, which I would bet is more numerically stable.

With that in mind, I would suggest fitting GCV on a re-scaled version of your problem.


def _compute_optimal_gcv_parameter_numstable(x, y):
    x_spacing = np.diff(x)
    assert (x_spacing >= 0).all(), "x must be sorted"
    x_spacing_avg = x_spacing.mean()
    assert x_spacing_avg != 0, "div by zero"
    new_x = x / x_spacing_avg
    X, wE, y, w = get_smoothing_spline_intermediate_arrays(new_x, y)
    # Rescale the value of lambda we found back to the original problem
    return _compute_optimal_gcv_parameter(X, wE, y, w) * x_spacing_avg ** 3


def get_smoothing_spline_intermediate_arrays(x, y, w=None):
    # This is the first half of your function make_smoothing_spline_with_bias from
    # https://stackoverflow.com/a/79736713/530160
    # ... snip ...
    return X, wE, y, w

Note that in order to be equivalent to the X scale from your original problem, the provided value of lambda must be scaled by x_spacing_avg ** 3. I validated this approach by comparing it to a version of _compute_optimal_gcv_parameter that uses an extra loop to search small lambda values more thoroughly. Both approaches find a lambda value of approx 2.63e-08 for the bad dataset.

Full code for this example:

from scipy.interpolate import make_smoothing_spline
from scipy.interpolate._bsplines import _compute_optimal_gcv_parameter
import matplotlib.pyplot as plt
import numpy as np
from scipy._lib._array_api import array_namespace
from scipy.interpolate._bsplines import _coeff_of_divided_diff
from scipy.interpolate import BSpline


# https://gist.github.com/nickodell/c42188e2b3acc307fe04d617efc00299
data = np.load("test1106.npz")
x_good = data['x_good']
X_good = data['X_good']
wE_good = data['wE_good']
y_good = data['y_good']
w_good = data['w_good']
x_bad = data['x_bad']
X_bad = data['X_bad']
wE_bad = data['wE_bad']
y_bad = data['y_bad']
w_bad = data['w_bad']


def _compute_optimal_gcv_parameter_numstable(x, y):
    x_spacing = np.diff(x)
    assert (x_spacing >= 0).all(), "x must be sorted"
    x_spacing_avg = x_spacing.mean()
    assert x_spacing_avg != 0, "div by zero"
    new_x = x / x_spacing_avg
    X, wE, y, w = get_smoothing_spline_intermediate_arrays(new_x, y)
    # Rescale the value of lambda we found back to the original problem
    return _compute_optimal_gcv_parameter(X, wE, y, w) * x_spacing_avg ** 3


def get_smoothing_spline_intermediate_arrays(x, y, w=None):
    axis = 0
    xp = array_namespace(x, y)

    x = np.ascontiguousarray(x, dtype=float)
    y = np.ascontiguousarray(y, dtype=float)

    if any(x[1:] - x[:-1] <= 0):
        raise ValueError('``x`` should be an ascending array')

    if x.ndim != 1 or x.shape[0] != y.shape[axis]:
        raise ValueError(f'``x`` should be 1D and {x.shape = } == {y.shape = }')

    if w is None:
        w = np.ones(len(x))
    else:
        w = np.ascontiguousarray(w)
        if any(w <= 0):
            raise ValueError('Invalid vector of weights')

    t = np.r_[[x[0]] * 3, x, [x[-1]] * 3]
    n = x.shape[0]

    if n <= 4:
        raise ValueError('``x`` and ``y`` length must be at least 5')

    y = np.moveaxis(y, axis, 0)
    y_shape1 = y.shape[1:]
    if y_shape1 != ():
        y = y.reshape((n, -1))

    # create design matrix in the B-spline basis
    X_bspl = BSpline.design_matrix(x, t, 3)
    X = np.zeros((5, n))
    for i in range(1, 4):
        X[i, 2: -2] = X_bspl[i: i - 4, 3: -3][np.diag_indices(n - 4)]

    X[1, 1] = X_bspl[0, 0]
    X[2, :2] = ((x[2] + x[1] - 2 * x[0]) * X_bspl[0, 0],
                X_bspl[1, 1] + X_bspl[1, 2])
    X[3, :2] = ((x[2] - x[0]) * X_bspl[1, 1], X_bspl[2, 2])
    X[1, -2:] = (X_bspl[-3, -3], (x[-1] - x[-3]) * X_bspl[-2, -2])
    X[2, -2:] = (X_bspl[-2, -3] + X_bspl[-2, -2],
                 (2 * x[-1] - x[-2] - x[-3]) * X_bspl[-1, -1])
    X[3, -2] = X_bspl[-1, -1]

    wE = np.zeros((5, n))
    wE[2:, 0] = _coeff_of_divided_diff(x[:3]) / w[:3]
    wE[1:, 1] = _coeff_of_divided_diff(x[:4]) / w[:4]
    for j in range(2, n - 2):
        wE[:, j] = (x[j+2] - x[j-2]) * _coeff_of_divided_diff(x[j-2:j+3])\
                   / w[j-2: j+3]
    wE[:-1, -2] = -_coeff_of_divided_diff(x[-4:]) / w[-4:]
    wE[:-2, -1] = _coeff_of_divided_diff(x[-3:]) / w[-3:]
    wE *= 6
    return X, wE, y, w


# Compute make_smoothing_spline 
spline_good = make_smoothing_spline(x_good, y_good, w=None, lam=None)
y_good_smooth = spline_good(x_good)
spline_bad = make_smoothing_spline(x_bad, y_bad, w=None, lam=None)
y_bad_smooth = spline_bad(x_bad)


# Compute lam
print('Good lam: ', _compute_optimal_gcv_parameter(X_good, wE_good, y_good, w_good))
print('Bad lam: ', _compute_optimal_gcv_parameter(X_bad, wE_bad, y_bad, w_bad))
newbad_lam =  _compute_optimal_gcv_parameter_numstable(x_bad, y_bad)
print('New bad lam: ', newbad_lam)


spline_newbad = make_smoothing_spline(x_bad, y_bad, w=None, lam=newbad_lam)
y_newbad_smooth = spline_newbad(x_bad)

# Plot input and smoothed data
plt.plot(x_good, y_good, label='good', color='green')
plt.plot(x_good, y_good_smooth, label='good_smoothed', color='palegreen')
plt.legend()
plt.show()
plt.plot(x_bad, y_bad, label='bad', color='red')
plt.plot(x_bad, y_bad_smooth, label='bad_smoothed', color='lightpink')
plt.legend()
plt.show()
plt.plot(x_bad, y_bad, label='bad', color='red')
plt.plot(x_bad, y_newbad_smooth, label='bad_new_x_smoothed', color='orange')
plt.legend()
plt.show()