Fast calculation of Nth generalized Fibonacci number of order K?

How can I calculate Nth term in Fibonacci sequence of order K efficiently?

For example, Tribonacci is Fibonacci order 3, Tetranacci is Fibonacci order 4, Pentanacci is Fibonacci order 5, and Hexanacci is Fibonacci order 6 et cetera.

I define these series as follows, for order K, A₀ = 1, A_i = 2^i-1 (i ∈ [1, k]), A_k+1 = 2^k - 1, A_i+1 = 2A_i - A_i-k-1 (i >= k + 1).

For example, the sequences Fibonacci to Nonanacci are:

[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040]
[1, 1, 2, 4, 7, 13, 24, 44, 81, 149, 274, 504, 927, 1705, 3136, 5768, 10609, 19513, 35890, 66012, 121415, 223317, 410744, 755476, 1389537, 2555757, 4700770, 8646064, 15902591, 29249425]
[1, 1, 2, 4, 8, 15, 29, 56, 108, 208, 401, 773, 1490, 2872, 5536, 10671, 20569, 39648, 76424, 147312, 283953, 547337, 1055026, 2033628, 3919944, 7555935, 14564533, 28074040, 54114452, 104308960]
[1, 1, 2, 4, 8, 16, 31, 61, 120, 236, 464, 912, 1793, 3525, 6930, 13624, 26784, 52656, 103519, 203513, 400096, 786568, 1546352, 3040048, 5976577, 11749641, 23099186, 45411804, 89277256, 175514464]
[1, 1, 2, 4, 8, 16, 32, 63, 125, 248, 492, 976, 1936, 3840, 7617, 15109, 29970, 59448, 117920, 233904, 463968, 920319, 1825529, 3621088, 7182728, 14247536, 28261168, 56058368, 111196417, 220567305]
[1, 1, 2, 4, 8, 16, 32, 64, 127, 253, 504, 1004, 2000, 3984, 7936, 15808, 31489, 62725, 124946, 248888, 495776, 987568, 1967200, 3918592, 7805695, 15548665, 30972384, 61695880, 122895984, 244804400]
[1, 1, 2, 4, 8, 16, 32, 64, 128, 255, 509, 1016, 2028, 4048, 8080, 16128, 32192, 64256, 128257, 256005, 510994, 1019960, 2035872, 4063664, 8111200, 16190208, 32316160, 64504063, 128752121, 256993248]
[1, 1, 2, 4, 8, 16, 32, 64, 128, 256, 511, 1021, 2040, 4076, 8144, 16272, 32512, 64960, 129792, 259328, 518145, 1035269, 2068498, 4132920, 8257696, 16499120, 32965728, 65866496, 131603200, 262947072]

Now, I am well aware of fast algorithms to calculate Nth Fibonacci number of order 2:

def fibonacci_fast(n: int) -> int:
    a, b = 0, 1
    bit = 1 << (n.bit_length() - 1) if n else 0

    while bit:
        a2 = a * a
        a, b = 2 * a * b + a2, b * b + a2
        if n & bit:
            a, b = a + b, a

        bit >>= 1

    return a

def matrix_mult_quad(
    a: int, b: int, c: int, d: int, e: int, f: int, g: int, h: int
) -> tuple[int, int, int, int]:
    return (
        a * e + b * g,
        a * f + b * h,
        c * e + d * g,
        c * f + d * h,
    )


def fibonacci_binet(n: int) -> int:
    a, b = 1, 1
    bit = 1 << (n.bit_length() - 2) if n else 0
    while bit:
        a, b = (a * a + 5 * b * b) >> 1, a * b
        if n & bit:
            a, b = (a + 5 * b) >> 1, (a + b) >> 1

        bit >>= 1

    return b


def fibonacci_matrix(n: int) -> int:
    if not n:
        return 0

    a, b, c, d = 1, 0, 0, 1
    e, f, g, h = 1, 1, 1, 0
    n -= 1
    while n:
        if n & 1:
            a, b, c, d = matrix_mult_quad(a, b, c, d, e, f, g, h)

        e, f, g, h = matrix_mult_quad(e, f, g, h, e, f, g, h)
        n >>= 1

    return a

In [591]: %timeit fibonacci_matrix(16384)
751 μs ± 4.74 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [592]: %timeit fibonacci_binet(16384)
132 μs ± 305 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [593]: %timeit fibonacci_fast(16384)
114 μs ± 966 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

But these of course only deal with Fibonacci-2 sequence, they can't be used to calculate Nth term in higher order Fibonacci sequences.

In particular, only the cubic equation for Tribonacci: x³ - x² - x - 1 = 0 and the quartic equation for Tetranacci: x⁴ - x³ - x² - x - 1 = 0 have solutions that can be found algebraically, the quintic equation x⁵ - x⁴ - x³ - x² - x - 1 = 0 has solutions that can't be found with sympy, so the fast doubling method only works with Fibonacci, Tribonacci and Tetranacci.

But I know of two ways to compute higher orders of Fibonacci sequences, the first can be used to efficiently generate all first N Fibonacci numbers of order K and has time complexity of O(n), the second is matrix exponentiation by squaring and has time complexity of O(log₂n) * O(k³).

First, we only need two numbers to get the next Fibonacci number of order K, the equation is given above, we only need one left shift and one subtraction for each term.

Matrix = list[int]

def onacci_fast(n: int, order: int) -> Matrix:
    if n <= order + 1:
        return [1] + [1 << i for i in range(n - 1)]

    last = n - 1
    result = [1] + [1 << i for i in range(order + 1)] + [0] * (last - order - 1)
    result[start := order + 1] -= 1
    for a, b in zip(range(start, last), range(1, last - order)):
        result[a + 1] = (result[a] << 1) - result[b]

    return result

ONACCI_MATRICES = {}
IDENTITIES = {}


def onacci_matrix(n: int) -> Matrix:
    if matrix := ONACCI_MATRICES.get(n):
        return matrix

    mat = [1] * n + [0] * (n * (n - 1))
    for i in range(1, n):
        mat[i * n + i - 1] = 1

    ONACCI_MATRICES[n] = mat
    return mat


def onacci_pow(n: int, k: int) -> np.ndarray:
    base = np.zeros((k, k), dtype=np.uint64)
    base[0] = 1
    for i in range(1, k):
        base[i, i - 1] = 1

    prod = np.zeros((k, k), dtype=np.uint64)
    for i in range(k):
        prod[i, i] = 1

    return [(prod := prod @ base) for _ in range(n)]


def identity_matrix(n: int) -> Matrix:
    if matrix := IDENTITIES.get(n):
        return matrix

    result = [0] * n**2
    for i in range(n):
        result[i * n + i] = 1

    IDENTITIES[n] = result
    return result


def mat_mult(mat_1: Matrix, mat_2: Matrix, side: int) -> Matrix:
    # sourcery skip: use-itertools-product
    result = [0] * (square := side**2)
    for y in range(0, square, side):
        for x in range(side):
            e = mat_1[y + x]
            for z in range(side):
                result[y + z] += mat_2[x * side + z] * e

    return result


def mat_pow(matrix: Matrix, power: int, n: int) -> Matrix:
    result = identity_matrix(n)
    while power:
        if power & 1:
            result = mat_mult(result, matrix, n)

        matrix = mat_mult(matrix, matrix, n)
        power >>= 1

    return result


def onacci_nth(n: int, k: int) -> int:
    return mat_pow(onacci_matrix(k), n, k)[0]

In [621]: %timeit onacci_nth(16384, 2)
822 μs ± 5.88 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [622]: %timeit onacci_fast(16384, 2)
13.8 ms ± 92.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [623]: %timeit onacci_fast(16384, 3)
16 ms ± 63.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [624]: %timeit onacci_fast(16384, 4)
17 ms ± 66.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [625]: %timeit onacci_nth(16384, 3)
4.02 ms ± 32 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [626]: %timeit onacci_nth(16384, 4)
10.9 ms ± 71 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [627]: %timeit onacci_nth(16384, 5)
22.5 ms ± 632 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [628]: %timeit onacci_nth(16384, 6)
39.4 ms ± 314 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [629]: %timeit onacci_fast(16384, 6)
17.5 ms ± 27.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [630]: %timeit onacci_fast(16384, 7)
17.6 ms ± 115 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [631]: %timeit onacci_nth(16384, 7)
62.7 ms ± 347 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [632]: %timeit onacci_nth(32768, 16)
2.29 s ± 5.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [633]: %timeit onacci_fast(32768, 16)
56.2 ms ± 271 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In the easiest case, onacci_nth(16384, 2) is significantly slower than fibonacci_matrix despite using exactly the same method, because of added overhead of using lists. And although the while loop has log₂n iterations, the matrices are of size k² and for each cell k multiplications and k - 1 additions have to be performed, for a total of k³ multiplications and k³ - k² additions each iteration, this cost grows very quickly, so the matrix exponentiation method is outperformed by the linear iterative method very quickly, because although the iterative method has more iterations, each iteration is far cheaper.

The matrix exponentiation uses all of the k*k matrix but we need fewer numbers. The following are the state transitions for Hexanacci:

[array([[1, 1, 1, 1, 1, 1],
        [1, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0],
        [0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 1, 0]], dtype=uint64),
 array([[2, 2, 2, 2, 2, 1],
        [1, 1, 1, 1, 1, 1],
        [1, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0],
        [0, 0, 0, 1, 0, 0]], dtype=uint64),
 array([[4, 4, 4, 4, 3, 2],
        [2, 2, 2, 2, 2, 1],
        [1, 1, 1, 1, 1, 1],
        [1, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0]], dtype=uint64),
 array([[8, 8, 8, 7, 6, 4],
        [4, 4, 4, 4, 3, 2],
        [2, 2, 2, 2, 2, 1],
        [1, 1, 1, 1, 1, 1],
        [1, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0]], dtype=uint64),
 array([[16, 16, 15, 14, 12,  8],
        [ 8,  8,  8,  7,  6,  4],
        [ 4,  4,  4,  4,  3,  2],
        [ 2,  2,  2,  2,  2,  1],
        [ 1,  1,  1,  1,  1,  1],
        [ 1,  0,  0,  0,  0,  0]], dtype=uint64),
 array([[32, 31, 30, 28, 24, 16],
        [16, 16, 15, 14, 12,  8],
        [ 8,  8,  8,  7,  6,  4],
        [ 4,  4,  4,  4,  3,  2],
        [ 2,  2,  2,  2,  2,  1],
        [ 1,  1,  1,  1,  1,  1]], dtype=uint64),
 array([[63, 62, 60, 56, 48, 32],
        [32, 31, 30, 28, 24, 16],
        [16, 16, 15, 14, 12,  8],
        [ 8,  8,  8,  7,  6,  4],
        [ 4,  4,  4,  4,  3,  2],
        [ 2,  2,  2,  2,  2,  1]], dtype=uint64),
 array([[125, 123, 119, 111,  95,  63],
        [ 63,  62,  60,  56,  48,  32],
        [ 32,  31,  30,  28,  24,  16],
        [ 16,  16,  15,  14,  12,   8],
        [  8,   8,   8,   7,   6,   4],
        [  4,   4,   4,   4,   3,   2]], dtype=uint64),
 array([[248, 244, 236, 220, 188, 125],
        [125, 123, 119, 111,  95,  63],
        [ 63,  62,  60,  56,  48,  32],
        [ 32,  31,  30,  28,  24,  16],
        [ 16,  16,  15,  14,  12,   8],
        [  8,   8,   8,   7,   6,   4]], dtype=uint64),
 array([[492, 484, 468, 436, 373, 248],
        [248, 244, 236, 220, 188, 125],
        [125, 123, 119, 111,  95,  63],
        [ 63,  62,  60,  56,  48,  32],
        [ 32,  31,  30,  28,  24,  16],
        [ 16,  16,  15,  14,  12,   8]], dtype=uint64)]

Now, for each state, we can shift the rows down by 1 to get the lower k - 1 rows of the next state, and the top row can be obtained by adding the first number to the second number and put in first position, adding the first number to the third number and put to second position, adding first to fourth put to third... put the first number to last position.

Or in Python:

def next_onacci(arr: np.ndarray) -> np.ndarray:
    a = arr[0, 0]
    return np.concatenate([[[a + b for b in arr[0, 1:]] + [a]], arr[:-1]])

So we only need k numbers to get the next state. But we can do better, the Nth term is found by raising the matrix to the Nth and accessing mat[0, 0]. We can use mat[0, 0] + mat[0, 1] to get the next mat[0, 0], or we can use 2 * mat[0, 0] - mat[-1, -1].

But I have only found ways to calculate the numbers using these relationships linearly, I can't use them to do exponentiation by squaring.

Is there a faster way to compute Nth term of higher order Fibonacci sequence?

Of course onacci_pow overflows extremely quickly, so it is absolutely useless to calculate the terms for large N. And so I don't use it to calculate terms for large N. I have implemented my own matrix multiplication and exponentiation to calculate with infinite precision.

onacci_pow is used to verify the correctness of my matrix multiplication for small N. onacci_pow is correct as long as it doesn't overflow, and I know exactly on which power it overflows.

And for the values of N and K, assume K is between 25 and 100, and N is between 16384 and 1048576.

Solution

Yes, there is a faster way to compute this.

Matrix Diagonalisation

The matrix exponentiation can be optimised if the matrix is diagonalisable.

It turns out it is diagonalisable. Indeed, we know that the n-order Fibonacci matrix M is a square matrix and all its rows and columns are linearly independent (by design). This means the rank of the matrix is n (i.e. full-rank matrix). Here is a (more complex) alternative explanation: the Spectral theorem states that a normal matrix can be diagonalised by an orthonormal basis of eigenvectors. Since we can prove that M @ M.T is equal to M.T @ M so M is a normal matrix and so it can be diagonalised.

This means M = Q @ L @ inv(Q) where Q contains the eigenvectors (one per column) and L is a diagonal matrix containing the eigenvalues (one on each item of the diagonal) (see singular value decomposition, a.k.a. SVD, and eigen-decomposition of a matrix for more information). We know that all eigenvalues are non-zeros ones since the matrix is full-rank.

We can compute that in Numpy with L, Q = np.linalg.eig(M). Note that the eigen values and eigenvector can contain complex numbers. We can check that everything so far works as expected with:

D = np.diag(L)
P = np.linalg.inv(Q)
T = Q @ D @ P
np.allclose(M, T)

We can now compute Mⁿ more efficiently than with square exponentiation. Indeed:

Mⁿ = (Q D P)ⁿ
   = (Q D P) (Q D P) (Q D P) ... (Q D P)
   = Q D (P Q) D (P Q) D (P  ...  Q) D P
   = Q D I D I D (I  ...  I) D P                        since P @ Q = I
   = Q D D D ... D P
   = Q Dⁿ P

Q @ Dⁿ @ P can be computed in O(k log n + k³). Indeed, Dⁿ is actually equal to D**n (item-wise exponentiation) in Numpy because D is a diagonal matrix. In fact, we can compute Mⁿ = Q @ Dⁿ @ P with:

Dn = np.diag(L**n)
P = np.linalg.inv(Q)
Mn = Q @ Dn @ P

This method is asymptotically about O(k²) times faster than the O(k³ log n). Actually, for fixed-size floating-point (FP) numbers, it is done in O(k³) time!

Unfortunately, fixed-sized FP numbers are certainly not so accurate for big numbers especially for high-order Fibonacci computation (i.e. performing a SVD computation of large matrices). This method can be seen as a very good approximation for the parameter you use for k < 100. That being said, the computation of L**n fails for values n > 16384 (actually even for much smaller values of n). This is especially true when L contains complex numbers. You can write your own eigen-decomposition of a matrix with arbitrary-large FP numbers (e.g. using the decimal package) though this is tedious to write and the code performance will certainly be far from being optimal (despite the algorithm is efficient). Sympy and mpmath might help for that.

Note on arbitrary-large numbers

Considering arbitrary-large FP/integer numbers, I expect the computation of the n-order Fibonacci term to be more expensive than expected first. Indeed, AFAIK F(n) is a number having O(n) bits! This means the above method is actually O(n) more expensive with such large numbers assuming basic operations are done in O(n) time. This is the case for the addition but not the multiplication. The native CPython multiplication of large numbers uses the Karatsuba algorithm running in about O(n**1.6) time, while decimal.Decimal uses the Number Theoretic Transform algorithm apparently running in O(n log n) time (thanks to @KellyBundy for pointing that out). The optimal complexity for a multiplication of n-bit numbers is O(n log n). As of this sentence, I will assume a O(n log n) algorithm is used for multiplying arbitrary-large numbers (so a code using decimal.Decimal for example). In fact, this also applies to the other algorithm provided in the question: you missed the fact that numbers can be huge!

Considering that, the above algorithm should run in O(k n log² n + n k³) with arbitrary-large FP numbers and considering a O(n log n) multiplication time (for the n-bit numbers).

Improving the diagonalisation-based algorithm

Fortunately, we do not need to compute the whole matrix Mn = Q @ Dn @ P but just a single item. We can replace the last matrix multiplication with a simple dot product. Thus, we only need a column of P and a single line of Q @ Dn. The later can be computed more efficiently since Dn is a diagonal matrix:

Mn[i, j] = np.dot(Q[i] * L**n, P[:,j])
         = np.sum(Q[i] * L**n * P[:,j])
         = np.sum(Q[i] * P[:,j] * L**n)

While we can compute Q[i] * P[:,j] efficiently independently of n, we certainly need a very-good precision for this computation since L**n is huge for relatively-big value of n.

This version should still run in O(k log n + k³) for fixed-size numbers (L**n is the expensive part) and it should run in O(k n log² n) for arbitrary-large FP numbers.

I think the complexity of the best algorithm (using arbitrary-large integers) is Ω(k n) since the k-order Fibonacci needs has k terms with a precision of n-bits. This assume the terms are not trivial though. Assuming that, this means this algorithm is actually pretty good already.

Possible further optimisations

One way to improve the diagonalisation-based algorithm further may be to algebraically compute the k-order Fibonacci matrix terms (typically using sympy assuming it can be powerful enough). You might see some terms cancelling each other or some pattern enabling further optimisations. You can find an example here for the classical Fibonacci (with 2x2 matrices). I am not very optimistic since the 2-order algebraic solution found contains 2 irreducible terms, so I expect at least k terms with a k-order Fibonacci. Besides, I think the algorithmic solution found is less numerically stable than the above approach due to a catastrophic cancellation (i.e. ϕⁿ−ψⁿ).