pythonscipysparse-matrixminimum-spanning-tree

Inconsistent results when using Scipy Minimum Spanning tree with sparse and dense inputs


I have the adjacency matrix of a graph stored as sparse scipy scr matrix. When I call the scipy.sparse.csgraph.minimum_spanning_tree function, my generated sparse array have too few non zero values ( less than V-1). Doing some tests with smaller datasets (just for test), I noticed that when I convert my sparse input matrix to a dense one, the function calculate a tree.

Here is my code to reproduce the behavior and my input sparse matrix :


import scipy as sp
from scipy.sparse.csgraph import  minimum_spanning_tree
from scipy.sparse import csr_matrix
sparse_matrix = sp.sparse.load_npz('sparse_matrix.npz')

MST = minimum_spanning_tree(sparse_matrix)
print(f"weight of MST: {MST.data.sum()}")

print(f'non-zero edges {MST.size}')

dense_matrix = sparse_matrix.todense()

MST2 = minimum_spanning_tree(dense_matrix)

print(f"weight of MST2: {MST2.data.sum()}")

print(f'non-zero edges {MST2.size}')

sparse_from_dense = csr_matrix(dense_matrix)
MST3 = minimum_spanning_tree(sparse_from_dense)

print(f"weight of MST3: {MST3.data.sum()}")

print(f'non-zero edges {MST3.size}')


The outputs are:

weight of MST: 1399.3271604776382 
non-zero edges 459 
weight of MST2: 1653.5667477846146 
non-zero edges 698 
weight of MST3: 1653.5667477846146 
non-zero edges 698

Does anyone know what is happening and how to get around it? Since the expected number of vertices is indeed 698, I believe the dense outputs are correct.

I tried to look for other ways to represent a matrix sparsely and calculate the MST from a sparse input but I didn't find any other alternatives.


Solution

  • Let's start by looking at the documentation for this function, and try to identify any ways in which it treats dense matrices differently from sparse matrices.

    In the Notes section, it says this:

    This routine uses undirected graphs as input and output. That is, if graph[i, j] and graph[j, i] are both zero, then nodes i and j do not have an edge connecting them. If either is nonzero, then the two are connected by the minimum nonzero value of the two.

    This routine loses precision when users input a dense matrix. Small elements < 1E-8 of the dense matrix are rounded to zero. All users should input sparse matrices if possible to avoid it.

    One possibility is that this sparse matrix contains data values smaller than 1E-8. Let's check this:

    >>> sparse_matrix = sp.sparse.load_npz('sparse_matrix.npz')
    >>> print(sparse_matrix.data[sparse_matrix.data < 1e-8])
    [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    

    It does - in fact it contains a bunch of zeros. In sparse matrices, there is a difference between implicit and explicit zeros. An implicit zero is any matrix element that does not appear. An explicit zero is a matrix element which is present, and set to zero.

    For most purposes in SciPy sparse matrices, these two things are equivalent. For minimum_spanning_tree(), there is a difference between explicit and implicit zeros.

    This gives us a way to solve the problem: change the explicit zeros into implicit zeros using eliminate_zeros().

    import scipy as sp
    from scipy.sparse.csgraph import  minimum_spanning_tree
    sparse_matrix = sp.sparse.load_npz('sparse_matrix.npz')
    
    sparse_matrix.eliminate_zeros()
    
    MST = minimum_spanning_tree(sparse_matrix)
    print(f"weight of MST: {MST.data.sum()}")
    print(f'non-zero edges {MST.size}')
    

    This gives your expected answer, without converting to dense first.