I tried to parallelize ape::dist_topo()
, a function to compute distances between unrooted trees.
Normally the function works like this (reprex: 4 random trees with 5 leaves each):
library(tidyverse)
# devtools::install_github("hadley/multidplyr")
library(multidplyr)
library(ape)
set.seed(3)
trees <-
map(rep(5, 4), rtree) %>%
do.call(c.phylo, .) %>% # To transform my list of phylo objects in a multiPhylo object
unroot.multiPhylo()
dist.topo(trees)
# tree1 tree2 tree3
# tree2 4
# tree3 4 2
# tree4 4 4 2
I created a function to compute distances 2 by 2 in a data.frame (in order to split in clusters by rows):
dist.topo2 <- function(multiphylo){
expand.grid(multiphylo, multiphylo) %>%
as.tibble() %>%
mutate(dist = map2(Var1, Var2, dist.topo)) %>%
pull(dist) %>%
matrix(., nrow = sqrt(length(.))) %>%
as.dist()
}
dist.topo2(trees)
# 1 2 3
# 2 4
# 3 4 2
# 4 4 4 2
As expected, the result is the same (regardless the names).
Then I added the multidplyr::partition()
and multidplyr::collect()
functions in my pipeline:
dist.topo3 <- function(multiphylo){
expand.grid(multiphylo, multiphylo) %>%
as.tibble() %>%
partition() %>%
mutate(dist = purrr::map2(Var1, Var2, ape::dist.topo)) %>%
collect() %>%
pull(dist) %>%
matrix(., nrow = sqrt(length(.))) %>%
as.dist()
}
dist.topo3(trees)
# 1 2 3
# 2 4
# 3 0 4
# 4 2 4 4
# Warning messages:
# 1: In bind_rows_(x, .id) :
# Vectorizing 'multiPhylo' elements may not preserve their attributes
# 2: In bind_rows_(x, .id) :
# Vectorizing 'multiPhylo' elements may not preserve their attributes
# 3: In bind_rows_(x, .id) :
# Vectorizing 'multiPhylo' elements may not preserve their attributes
# 4: In bind_rows_(x, .id) :
# Vectorizing 'multiPhylo' elements may not preserve their attributes
# 5: In bind_rows_(x, .id) :
# Vectorizing 'multiPhylo' elements may not preserve their attributes
# 6: In bind_rows_(x, .id) :
# Vectorizing 'multiPhylo' elements may not preserve their attributes
As you can see, the distances are different whereas the operations didn't change.
How can I fix that ? Maybe it's not possible (See here)
Thanks
Note: I'm aware that this solution could be non-optimal (especially because it computes each distance two times) but it's not the point.
The issue is that partition
will shard the data.frame randomly and collect
will unshard the data.frame randomly. If you add the row number as a column and arrange after collecting, it fixes the issue
dist.topo3 <- function(multiphylo){
expand.grid(multiphylo, multiphylo) %>%
as.tibble() %>%
mutate(rn = row_number()) %>%
partition(rn) %>%
mutate(dist = purrr::map2(Var1, Var2, ape::dist.topo)) %>%
collect() %>%
arrange(rn) %>%
pull(dist) %>%
matrix(., nrow = sqrt(length(.))) %>%
as.dist()
}
dist.topo3(trees)
# 1 2 3
# 2 4
# 3 4 2
# 4 4 4 2