rdplyrtidyversemultidplyr

Vectorizing with multidplyr does not render the correct output


I tried to parallelize ape::dist_topo(), a function to compute distances between unrooted trees.

Normally the function works like this (reprex: 4 random trees with 5 leaves each):

library(tidyverse)
# devtools::install_github("hadley/multidplyr")
library(multidplyr)
library(ape)
set.seed(3)

trees <- 
  map(rep(5, 4), rtree) %>% 
  do.call(c.phylo, .) %>% # To transform my list of phylo objects in a multiPhylo object
  unroot.multiPhylo()

dist.topo(trees)
#      tree1 tree2 tree3
# tree2     4            
# tree3     4     2      
# tree4     4     4     2

I created a function to compute distances 2 by 2 in a data.frame (in order to split in clusters by rows):

dist.topo2 <- function(multiphylo){
  expand.grid(multiphylo, multiphylo) %>% 
    as.tibble() %>% 
    mutate(dist = map2(Var1, Var2, dist.topo)) %>% 
    pull(dist) %>% 
    matrix(., nrow = sqrt(length(.))) %>% 
    as.dist()
}

dist.topo2(trees)
#   1 2 3
# 2 4    
# 3 4 2  
# 4 4 4 2

As expected, the result is the same (regardless the names).

Then I added the multidplyr::partition() and multidplyr::collect() functions in my pipeline:

dist.topo3 <- function(multiphylo){
  expand.grid(multiphylo, multiphylo) %>% 
    as.tibble() %>% 
    partition() %>%
    mutate(dist = purrr::map2(Var1, Var2, ape::dist.topo)) %>% 
    collect() %>%
    pull(dist) %>% 
    matrix(., nrow = sqrt(length(.))) %>% 
    as.dist()
}

dist.topo3(trees)
#   1 2 3
# 2 4    
# 3 0 4  
# 4 2 4 4
# Warning messages:
# 1: In bind_rows_(x, .id) :
#   Vectorizing 'multiPhylo' elements may not preserve their attributes
# 2: In bind_rows_(x, .id) :
#   Vectorizing 'multiPhylo' elements may not preserve their attributes
# 3: In bind_rows_(x, .id) :
#   Vectorizing 'multiPhylo' elements may not preserve their attributes
# 4: In bind_rows_(x, .id) :
#   Vectorizing 'multiPhylo' elements may not preserve their attributes
# 5: In bind_rows_(x, .id) :
#   Vectorizing 'multiPhylo' elements may not preserve their attributes
# 6: In bind_rows_(x, .id) :
#   Vectorizing 'multiPhylo' elements may not preserve their attributes

As you can see, the distances are different whereas the operations didn't change.

How can I fix that ? Maybe it's not possible (See here)

Thanks

Note: I'm aware that this solution could be non-optimal (especially because it computes each distance two times) but it's not the point.


Solution

  • The issue is that partition will shard the data.frame randomly and collect will unshard the data.frame randomly. If you add the row number as a column and arrange after collecting, it fixes the issue

    dist.topo3 <- function(multiphylo){
      expand.grid(multiphylo, multiphylo) %>% 
        as.tibble() %>% 
        mutate(rn = row_number()) %>%
        partition(rn) %>%
        mutate(dist = purrr::map2(Var1, Var2, ape::dist.topo)) %>% 
        collect() %>%
        arrange(rn) %>%
        pull(dist) %>% 
        matrix(., nrow = sqrt(length(.))) %>% 
        as.dist()
    }
    dist.topo3(trees)
    #   1 2 3
    # 2 4    
    # 3 4 2  
    # 4 4 4 2