optimizationjuliasparse-matrixsparse-array

Optimising Sparse Array Math


I have a sparse array: term_doc

The operator I would like to perform can be described as returning the row normalized and column normalized versions this matrix.

The Naive nonsparse version, I wrote is:

function doUnsparseWay()
    gc() #Force Garbage collect before I start (and periodically during). This uses alot of memory
    term_doc

    N = term_doc./sum(term_doc,1)
    println("N done")  

    gc()
    P = term_doc./sum(term_doc,2)    
    println("P done")
    gc()

    N[isnan(N)] = 0.0
    P[isnan(P)] = 0.0

    N,P,term_doc
end

Running this:

> @time N,P,term_doc= doUnsparseWay()
outputs:
N done
P done
elapsed time: 30.97332475 seconds (14466 MB allocated, 5.15% gc time in 13 pauses with 3 full sweep)

It is fairly simple. It chews memory, and will crash if the garbage collection does not occur at the right times (Thus I call it manually). But it is fairly fast


I wanted to get it to work on the sparse matrix. So as not to chew my memory out, and because logically it is a faster operation -- less cells need operating on.

I followed suggestions from this post and from the performance page of the docs.

function doSparseWay()
    term_doc::SparseMatrixCSC{Float64,Int64}

    N= spzeros(size(term_doc)...)
    N::SparseMatrixCSC{Float64,Int64}

    for (doc,total_terms::Float64) in enumerate(sum(term_doc,1))
        if total_terms == 0
            continue
        end
        @fastmath @inbounds N[:,doc] = term_doc[:,doc]./total_terms
    end
    println("N done")  

    P = spzeros(size(term_doc)...)'
    P::SparseMatrixCSC{Float64,Int64}

    gfs = sum(term_doc,2)[:]
    gfs::Array{Float64,1} 


    nterms = size(term_doc,1)
    nterms::Int64 
    term_doc = term_doc'

    @inbounds @simd for term in 1:nterms
        @fastmath @inbounds P[:,term] = term_doc[:,term]/gfs[term]
    end
    println("P done")
    P=P'


    N[isnan(N)] = 0.0
    P[isnan(P)] = 0.0

    N,P,term_doc
end

It never completes. It gets up to outputting "N Done", but never outputs "P Done". I have left it running for several hours.


Solution

  • First, you're making term_doc a global variable, which is a big problem for performance. Pass it as an argument, doSparseWay(term_doc::SparseMatrixCSC). (The type annotation at the beginning of your function does not do anything useful.)

    You want to use an approach similar to the answer by walnuss:

    function doSparseWay(term_doc::SparseMatrixCSC)
        I, J, V = findnz(term_doc)
        normI = sum(term_doc, 1)
        normJ = sum(term_doc, 2)
        NV = similar(V)
        PV = similar(V)
        for idx = 1:length(V)
            NV[idx] = V[idx]/normI[J[idx]]
            PV[idx] = V[idx]/normJ[I[idx]]
        end
        m, n = size(term_doc)
        sparse(I, J, NV, m, n), sparse(I, J, PV, m, n), term_doc
    end
    

    This is a general pattern: when you want to optimize something for sparse matrices, extract the I, J, V and perform all your computations on V.