performancejuliahalf-precision-float

Why is operating on Float64 faster than Float16?


I wonder why operating on Float64 values is faster than operating on Float16:

julia> rnd64 = rand(Float64, 1000);

julia> rnd16 = rand(Float16, 1000);

julia> @benchmark rnd64.^2
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.800 μs … 662.140 μs  ┊ GC (min … max):  0.00% … 99.37%
 Time  (median):     2.180 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.457 μs ±  13.176 μs  ┊ GC (mean ± σ):  12.34% ±  3.89%

  ▁██▄▂▂▆▆▄▂▁ ▂▆▄▁                                     ▂▂▂▁   ▂
  ████████████████▇▇▆▆▇▆▅▇██▆▆▅▅▆▄▄▁▁▃▃▁▁▄▁▃▄▁▃▁▄▃▁▁▆▇██████▇ █
  1.8 μs       Histogram: log(frequency) by time      10.6 μs <

 Memory estimate: 8.02 KiB, allocs estimate: 5.

julia> @benchmark rnd16.^2
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.117 μs … 587.133 μs  ┊ GC (min … max): 0.00% … 98.61%
 Time  (median):     5.383 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.716 μs ±   9.987 μs  ┊ GC (mean ± σ):  3.01% ±  1.71%

    ▃▅█▇▅▄▄▆▇▅▄▁             ▁                                ▂
  ▄██████████████▇▆▇▆▆▇▆▇▅█▇████▇█▇▇▆▅▆▄▇▇▆█▇██▇█▇▇▇▆▇▇▆▆▆▆▄▄ █
  5.12 μs      Histogram: log(frequency) by time      7.48 μs <

 Memory estimate: 2.14 KiB, allocs estimate: 5.

Maybe you ask why I expect the opposite: Because Float16 values have less floating point precision:

julia> rnd16[1]
Float16(0.627)

julia> rnd64[1]
0.4375452455597999

Shouldn't calculations with fewer precisions take place faster? Then I wonder why someone should use Float16? They can do it even with Float128!


Solution

  • As you can see, the effect you are expecting is present for Float32:

    julia> rnd64 = rand(Float64, 1000);
    
    julia> rnd32 = rand(Float32, 1000);
    
    julia> rnd16 = rand(Float16, 1000);
    
    julia> @btime $rnd64.^2;
      616.495 ns (1 allocation: 7.94 KiB)
    
    julia> @btime $rnd32.^2;
      330.769 ns (1 allocation: 4.06 KiB)  # faster!!
    
    julia> @btime $rnd16.^2;
      2.067 μs (1 allocation: 2.06 KiB)  # slower!!
    

    Float64 and Float32 have hardware support on most platforms, but Float16 does not, and must therefore be implemented in software.

    Note also that you should use variable interpolation ($) when micro-benchmarking. The difference is significant here, not least in terms of allocations:

    julia> @btime $rnd32.^2;
      336.187 ns (1 allocation: 4.06 KiB)
    
    julia> @btime rnd32.^2;
      930.000 ns (5 allocations: 4.14 KiB)