Why does parLapplyLB not actually balance load?

I'm testing out the parLapplyLB() function to understand what it does to balance a load. But I'm not seeing any balancing happening. For example,

cl <- parallel::makeCluster(2)

system.time(
  parallel::parLapplyLB(cl, 1:4, function(y) {
    if (y == 1) {
      Sys.sleep(3)
    } else {
      Sys.sleep(0.5)
    }}))
##   user  system elapsed 
##  0.004   0.009   3.511 

parallel::stopCluster(cl)

If it was truly balancing the load, the first job (job 1) that sleeps for 3 seconds would be on the first node and the other three jobs (jobs 2:4) would sleep for a total of 1.5 seconds on the other node. In total, the system time should only be 3 seconds.

Instead, I think that jobs 1 and 2 are given to node 1 and jobs 3 and 4 are given to node 2. This results in the total time being 3 + 0.5 = 3.5 seconds. If we run the same code above with parLapply() instead of parLapplyLB(), we get the same system time of about 3.5 seconds.

What am I not understanding or doing wrong?

Solution

NOTE: Since R-3.5.0, the behavior/bug noted by the OP and explained below has been fixed. As noted in R's NEWS file at the time:

* parLapplyLB and parSapplyLB have been fixed to do load balancing
  (dynamic scheduling).  This also means that results of
  computations depending on random number generators will now
  really be non-reproducible, as documented.

ORIGINAL ANSWER (Now only relevant for R versions < 3.5.0 )

For a task like yours (and, for that matter, for any task for which I've ever needed parallel) parLapplyLB isn't really the right tool for the job. To see why not, have a look at the way that it's implemented:

parLapplyLB
# function (cl = NULL, X, fun, ...) 
# {
#     cl <- defaultCluster(cl)
#     do.call(c, clusterApplyLB(cl, x = splitList(X, length(cl)), 
#         fun = lapply, fun, ...), quote = TRUE)
# }
# <bytecode: 0x000000000f20a7e8>
# <environment: namespace:parallel>

## Have a look at what `splitList()` does:
parallel:::splitList(1:4, 2)
# [[1]]
# [1] 1 2
# 
# [[2]]
# [1] 3 4

The problem is that it first splits its list of jobs up into equal-sized sublists that it then distributes among the nodes, each of which runs lapply() on its given sublist. So here, your first node runs jobs on the first and second inputs, while the second node runs jobs using the third and fourth inputs.

Instead, use the more versatile clusterApplyLB(), which works just as you'd hope:

system.time(
  parallel::clusterApplyLB(cl, 1:4, function(y) {
    if (y == 1) {
      Sys.sleep(3)
    } else {
      Sys.sleep(0.5)
    }}))
# user  system elapsed 
# 0.00    0.00    3.09