dataframejuliadot-product

Increasing the performance of dot product calculation Julia Dataframe


I have four different ways of calculating a row-wise dot product in a julia dataframe.

using Random, DataFrames, BenchmarkTools, LinearAlgebra

df = DataFrame(rand(Float64, (6,6)), :auto)

a(x) = dot.(eachrow((x[:, Cols(Between(:x1, :x2))])), eachrow((x[:, Cols(Between(:x3, :x4))])))
b(x) = diag(Matrix(x[:, Cols(Between(:x1, :x2))]) * Matrix(x[:, Cols(Between(:x3, :x4))])')
c(x) = select(x, Between(:x1, :x4) => ByRow((x1, x2, x3, x4) -> dot([x1,x2], [x3, x4])))
d(x) = transform(x, Between(:x1, :x4) => ByRow((x1, x2, x3, x4) -> dot([x1,x2], [x3, x4])))

@btime a(df);
@btime b(df);
@btime c(df);
@btime d(df);

I much prefer the last two (c(), d()) as I personally find them easier to read. Regretfully, they are much slower (the real dataframe I am working with is much larger than the example provided). I would like to know if there is trick I am missing in the last two implementations.


Solution

  • When I use your code but increase the size the ranking is changing as there is some overhead when using subset. This still could be speed up when subsetting the dataframe to return vectors. I get the fastet runs when using AxisArrays.

    using Random, DataFrames, BenchmarkTools, LinearAlgebra, TypedTables, NamedArrays, AxisArrays
    
    a(x) = dot.(eachrow((x[:, Cols(Between(:x1, :x2))])), eachrow((x[:, Cols(Between(:x3, :x4))])))
    b(x) = diag(Matrix(x[:, Cols(Between(:x1, :x2))]) * Matrix(x[:, Cols(Between(:x3, :x4))])')
    c(x) = select(x, Between(:x1, :x4) => ByRow((x1, x2, x3, x4) -> dot([x1,x2], [x3, x4])))
    d(x) = transform(x, Between(:x1, :x4) => ByRow((x1, x2, x3, x4) -> dot([x1,x2], [x3, x4])))
    e(x) = map(row -> dot([row.x1, row.x2], [row.x3, row.x4]), Table(x))  # @user9712582
    f(x) = dot.(eachrow(x[:,1:2]), eachrow(x[:,3:4]))
    g(x) = dot.(eachrow(x[:,[:x1, :x2]]), eachrow(x[:,[:x3, :x4]]))
    h(x) = dot.(eachrow([x[:,:x1] x[:,:x2]]), eachrow([x[:,:x3] x[:,:x4]]))
    i(x) = sum(x[:, 1:2] .* x[:, 3:4], dims=2)
    j(x) = x[:, 1] .* x[:, 3] .+ x[:, 2] .* x[:, 4]
    
    n = 100000  # Number of Rows
    y = rand(Float64, (n,6))
    df = DataFrame(y, :auto)
    na = NamedArray(y, (1:n, propertynames(df)))
    aa = AxisArray(y, 1:n, propertynames(df))
    
    @btime a($df);  # 76.350 ms (2298034 allocations: 44.99 MiB)
    @btime b($df);  # OutOfMemoryError()
    @btime c($df);  # 6.536 ms (200271 allocations: 16.03 MiB)
    @btime d($df);  # 7.055 ms (200300 allocations: 20.61 MiB)
    @btime e($df);  # 6.503 ms (200018 allocations: 16.02 MiB)
    @btime f($df);  # 75.219 ms (2298009 allocations: 44.99 MiB)
    @btime f($na);  # 1.754 s (6299125 allocations: 343.37 MiB)
    @btime f($aa);  # 1.520 ms (8 allocations: 3.81 MiB)
    @btime f($y);   # 2.133 ms (6 allocations: 3.81 MiB)
    @btime g($df);  # 75.158 ms (2298013 allocations: 44.99 MiB)
    @btime g($na);  # 1.746 s (6299121 allocations: 343.37 MiB)
    @btime g($aa);  # 1.540 ms (22 allocations: 3.82 MiB)
    @btime h($df);  # 2.578 ms (17 allocations: 6.87 MiB)
    @btime h($na);  # 1.751 s (6299278 allocations: 364.03 MiB)
    @btime h($aa);  # 2.600 ms (14 allocations: 6.87 MiB)
    @btime i($y);   # 1.417 ms (12 allocations: 5.34 MiB)
    @btime j($df);  # 492.200 μs (13 allocations: 3.82 MiB)
    @btime j($y);   # 637.600 μs (10 allocations: 3.81 MiB)
    @btime j($aa);  # 632.800 μs (10 allocations: 3.81 MiB)
    
    n = 6
    y = rand(Float64, (n,6))
    df = DataFrame(y, :auto)
    na = NamedArray(y, (1:n, propertynames(df)))
    aa = AxisArray(y, 1:n, propertynames(df))
    
    @btime a($df);  # 7.967 μs (155 allocations: 6.41 KiB)
    @btime b($df);  # 5.400 μs (63 allocations: 5.11 KiB)
    @btime c($df);  # 99.000 μs (280 allocations: 13.92 KiB)
    @btime d($df);  # 102.500 μs (298 allocations: 15.00 KiB)
    @btime e($df);  # 3.913 μs (29 allocations: 1.88 KiB)
    @btime f($df);  # 7.050 μs (130 allocations: 5.06 KiB)
    @btime f($na);  # 104.900 μs (448 allocations: 25.45 KiB)
    @btime f($aa);  # 373.913 ns (5 allocations: 560 bytes)
    @btime f($y);   # 255.890 ns (3 allocations: 432 bytes)
    @btime g($df);  # 7.250 μs (134 allocations: 5.34 KiB)
    @btime g($na);  # 104.900 μs (444 allocations: 25.39 KiB)
    @btime g($aa);  # 980.000 ns (19 allocations: 1.70 KiB)
    @btime h($df);  # 703.448 ns (10 allocations: 1008 bytes)
    @btime h($na);  # 180.800 μs (531 allocations: 30.09 KiB)
    @btime h($aa);  # 524.084 ns (7 allocations: 880 bytes)
    @btime i($y);   # 348.131 ns (8 allocations: 672 bytes)
    @btime j($df);  # 361.244 ns (8 allocations: 672 bytes)
    @btime j($y);   # 159.030 ns (5 allocations: 560 bytes)
    @btime j($aa);  # 263.174 ns (5 allocations: 560 bytes)