rstatisticsmicrobenchmark

What is the meaning of the 'cld' column in 'microbenchmark'?


I always thought that the cld column in the output of microbenchmark was a statistical ranking of the speed. However this is not true:

> microbenchmark(
+   intmap = fintmap(), # slower
+   List   = flist(),
+   times = 5
+ )
Unit: microseconds
   expr     min      lq      mean  median       uq      max neval cld
 intmap 793.984 910.539 1145.8608 911.840 1290.529 1822.412     5  a 
   List   1.092   1.318  201.3712   1.639    3.660  999.147     5   b

So what is it? The doc only says it is a statistical ranking, but of what?

Or maybe it is a multiple comparison test of the speeds but the inequality of the standard deviations can cause such an issue? There's clearly an outlier in the second benchmark.


Edit

It seems that my question was not clear. I know the meaning of the letters a and b, this is the classical way to report a Tukey test. But the results are not coherent here: intmap is slower but is ranked first.


Solution

  • the cld is a Compact Letter Display brought over from the package multcomp.

    From that package: "Equal letters indicate no significant differences."

    What I can't currently determine in whether or not it's meant to be ranked or just classified i.e. is a meant to be generally faster than b or just different?

    The code in microbenchmark::summary is:

          ops <- options(warn=-1)
          mdl <- lm(time ~ expr, object)
          comp <- multcomp::glht(mdl, multcomp::mcp(expr = "Tukey"))
          res$cld <- multcomp::cld(comp)$mcletters$monospacedLetters
    

    So from that, it appears to be generating a linear model lm() from the raw times (not the means etc), then setting up multiple comparisons object glht() for all-pair comparisons, then reducing that to a cld using cld().

    EDIT: Testing ranking:

    a <- rnorm(1000)
    a
    
    microbenchmark(
      alpha = mean(a),
      beta = a/length(a) |> sum(),
      gamma = sum(a) / length(a),
      times = 10000,
      unit = "nanoseconds"
    )
    
    Unit: nanoseconds
      expr  min   lq    mean median   uq      max neval cld
     alpha 4700 5500 6325.56   5700 6800    37700 10000  a 
      beta 1700 2700 5307.55   2900 3300 12419800 10000  a 
     gamma  900 1100 1240.32   1100 1300    24000 10000   b
    
    microbenchmark(
      gamma = sum(a) / length(a),
      alpha = mean(a),
      beta = a/length(a) |> sum(),
      times = 10000,
      unit = "nanoseconds"
    
    Unit: nanoseconds
      expr  min   lq    mean median   uq      max neval cld
     gamma  900 1100 1214.29   1100 1200    23700 10000  a 
     alpha 4900 5500 6039.82   5700 6200    71900 10000   b
      beta 1700 2500 5459.20   3000 3200 12272900 10000   b
    )
    
    

    This would appear to demonstrate that, as suspected, the order of the entries in the table is listed as provided to microbenchmark(), and the cld are assigned sequentially based on this order NOT by the overall speed ranking.

    edit 2: playing with ordering

    d <- microbenchmark(
      alpha = mean(a),
      beta = a/length(a) |> sum(),
      gamma = sum(a + a - a) / length(a),
      times = 10000,
      unit = "nanoseconds"
    )
    
    print(d, order = "cld")
    
    Unit: nanoseconds
      expr  min   lq    mean median   uq     max neval cld
      beta 1700 1900 2386.04   2000 2300   53400 10000   b
     alpha 5000 5500 6219.35   5700 6400   72700 10000  a 
     gamma 1900 2200 4378.53   2400 2600 8532200 10000  ab
    
    

    Looks to me like it sorts the cld alphabetically as though it were a set of columns, so it sorts by a (blanks at the top) then by b (ditto) etc...