r multidimensional-array julia distance-matrix

Creating a pairwise distance matrix in Julia from triplets (obj, obj, distance/simularity)

I have a matrix that is in the form of "obj obj distance" a toy matrix would look like this:

r = [[1,1,1,2,2,2,3,3,3]  [1,2,3,1,2,3,1,2,3]  [0,.5,.75,.75,0,.25,.5,.25,0]]

I currently have a solution along the lines of:

using NamedArrays

function DistanceMatrixFromTriColumn(r::AbstractArray)
               Names = unique(vcat(r[:,1],r[:,2]))
               Dist = NamedArray(zeros(length(Names),length(Names)),(Names,Names))
               for row in Names
                       for col in Names
                               Dist[row,col] = r[(r[:,1] .== row) .& (r[:,2] .== col),:3][1]
                       end
               end
               return Dist
       end

DistanceMatrixFromTriColumn(r)

which is really just a nested for loop that goes through the first column, and then goes through the second column. It outputs something like this:

3×3 Named Matrix{Float64}
A ╲ B │  1.0   2.0   3.0
──────┼─────────────────
1.0   │  0.0   0.5  0.75
2.0   │ 0.75   0.0  0.25

But I think there is a faster way, there should be a way to use Indexes to do something similar, no? I think I remember seeing someone do a solution like that in R once-upon-a-time.

Solution

For the case you have posted where the objects are reliably numbered as such, you can make this quite a lot faster with something along the lines of

function distancematrixfromtricolumn(r::AbstractArray)
    names = unique(view(r,:,1))
    dist = NamedArray(zeros(length(names),length(names)),(names,names))
    for i = 1:size(r,1)
        row = Int(r[i,1])
        col = Int(r[i,2])
        dist[row,col] = r[i,3]
    end
    return dist
end

You may also notice a few stylistic changes: I have switched a few names to lowercase to follow the Julia convention that functions and variables are lowercase whereas Modules and Types are CamelCase.

Comparing the two versions:

julia> distancematrixfromtricolumn(r)
3×3 Named Matrix{Float64}
A ╲ B │  1.0   2.0   3.0
──────┼─────────────────
1.0   │  0.0   0.5  0.75
2.0   │ 0.75   0.0  0.25
3.0   │  0.5  0.25   0.0

julia> DistanceMatrixFromTriColumn(r) == distancematrixfromtricolumn(r)
true

julia> using BenchmarkTools

julia> @benchmark DistanceMatrixFromTriColumn($r)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.897 μs … 535.305 μs  ┊ GC (min … max): 0.00% … 98.80%
 Time  (median):     4.863 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.828 μs ±  16.267 μs  ┊ GC (mean ± σ):  9.92% ±  3.54%

    ▂▆▇█▇▇▅▃▁
  ▃▅█████████▇▇▇▆▅▅▆▅▅▄▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  3.9 μs          Histogram: frequency by time        10.6 μs <

 Memory estimate: 7.39 KiB, allocs estimate: 73.

julia> @benchmark distancematrixfromtricolumn($r)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.352 μs … 528.474 μs  ┊ GC (min … max): 0.00% … 99.32%
 Time  (median):     1.543 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.821 μs ±   8.510 μs  ┊ GC (mean ± σ):  8.04% ±  1.72%

   ▇█▄▁▁▁
  ███████▆▆▆▅▄▃▃▃▃▂▂▂▂▃▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁ ▂
  1.35 μs         Histogram: frequency by time        3.33 μs <

 Memory estimate: 2.08 KiB, allocs estimate: 25.

While there are in principle a few other things one could consider to make the for loop even faster, at this point much of our execution time and all of our allocations are coming from the call to unique and the construction of the NamedArray.