pythonnumpypagerank

How to vectorize advanced indexing with list of lists in NumPy?


The following code runs in 45s when using pure Python.

for iteration in range(maxiter):
    for node in range(n):
        for dest in adjacency_list[node]:
            rs[iteration + 1][dest] += beta * rs[iteration][node] / len(adjacency_list[node])

But, by simply initializing rs as a numpy ndarray instead of a python list of lists, the code runs in 145s. I do not really know why numpy takes 3 times as much time with this array indexing.

My idea was to vectorize as much things as possible, but have only managed to vectorize the multiplication of beta/len(adjacency_list[node]). This code runs in 77s.

beta_over_out_degree = np.array([beta / len(al) for al in adjacency_list])
for iteration in range(1, maxiter + 1):
    r_next = np.full(shape=n, fill_value=(1 - beta) / n)
    f = beta_over_out_degree * r
    for i in range(n):
        r_next[adjacency_list[i]] += f[i]

    r = np.copy(r_next)
    rs[iteration] = np.copy(r)

The problem is that adjacency_list is a list of lists with differing column size, with 100 000 rows and 1-15 columns. A more standard approach with an adjacency matrix, at least as a normal ndarray, is not an option, since for n=100 000 its shape of (n,n) is too big to be allocated to memory.

Is there any way to vectorize using its indexes for numpy advanced indexing(maybe turning it into a numpy ndarray)?

I would also greatly appreciate any other speed tips. Thanks in advance!

EDIT: Thanks to @stevemo I managed to create adjacency_matrix with csr_matrix functionality and used it for iterative multiplication. Program now runs in only 2s!

for iteration in range(1, 101):
    rs[iteration] += rs[iteration - 1] * adjacency_matrix

Solution

  • If I understand you correctly, this can be done with a one-liner formula using matrix powers of the adjacency matrix.

    Based on your original code snippet, it seems you have some network of n nodes, with the adjacency information stored as a list of lists in adjacency, and you have a value r associated with each node such its value at iteration k+1 is beta times the sum of r of each of its neighbors at iter k. (Your loop constructs this in the opposite direction, but same thing.)

    If you don't mind reforming your adjacency list-of-lists into a more standard adjacency matrix, such that A_ij = 1 if ij are neighbors, 0 otherwise, then you could accomplish the inner two loops with a simple matrix product, r[k+1] = beta * (A @ r[k]).

    And following this logic, r[k+2] = beta * (A @ (beta * (A @ r[k]))) = (beta * A)**2 @ r[k] or in general,

    r[k] = (beta * A)**k @ r[0]
    

    Let's try this on a small network:

    # adjacency matrix
    A = np.array([
        [0, 1, 1, 0, 0],
        [1, 0, 1, 0, 0],
        [1, 1, 0, 1, 0],
        [0, 0, 1, 0, 1],
        [0, 0, 0, 1, 0]
    ])
    
    # initial values
    n = 5
    beta = 0.5
    r0 = np.ones(n)
    maxiter = 10
    
    # after one iteration
    print(beta * (A @ r0))
    # [1.  1.  1.5 1.  0.5]
    
    # after 10 iterations
    print(np.linalg.matrix_power((beta * A), maxiter) @ r0)
    # [2.88574219 2.88574219 3.4921875  1.99414062 0.89257812]