rdplyrpurrrfurrr

Is it possible to speed this function using dplyr and furrr up?


I have a data frame where rows correspond to documents and columns capture individual words in those documents.

library(tidyverse)
library(furrr)
#> Loading required package: future

doc_by_word_df <- structure(list(
 doc_id = c("doc1.txt", "doc2.txt", "doc3.txt"),
 kwpe_1 = c("apple", "fish", "apple"),
 kwpe_2 = c("bananna", "grain", "insects"),
 kwpe_3 = c("carrot", "insects", "grain")),
 class = c("tbl_df", "tbl", "data.frame"),
 row.names = c(NA,-3L))

doc_by_word_df
#> # A tibble: 3 × 4
#>   doc_id   kwpe_1 kwpe_2  kwpe_3 
#>   <chr>    <chr>  <chr>   <chr>  
#> 1 doc1.txt apple  bananna carrot 
#> 2 doc2.txt fish   grain   insects
#> 3 doc3.txt apple  insects grain

I would like to identify all the documents containing any of the possible pair combinations of words in those documents.

To do that I have created a vector of all the words in the data set and extracted all the unique word-pair combinations.

all_words <- c("apple", "fish", "apple", "bananna", "grain", "insects", "carrot", "insects", "grain")

unique_keyword_pair <- combn(unique(all_words), 2)

unique_keyword_pair
#>      [,1]    [,2]      [,3]    [,4]      [,5]     [,6]      [,7]    [,8]     
#> [1,] "apple" "apple"   "apple" "apple"   "apple"  "fish"    "fish"  "fish"   
#> [2,] "fish"  "bananna" "grain" "insects" "carrot" "bananna" "grain" "insects"
#>      [,9]     [,10]     [,11]     [,12]     [,13]     [,14]    [,15]    
#> [1,] "fish"   "bananna" "bananna" "bananna" "grain"   "grain"  "insects"
#> [2,] "carrot" "grain"   "insects" "carrot"  "insects" "carrot" "carrot"

I've made a function that uses the unique word-pairs to filter out all the documents containing those word-pairs and mapped that function over the data frame.

It works in the way I'd like it to, but it takes a very long time to run. I have used the furrr package to try to speed this up, but I'm still left with a very long run time. Originally I did this with a for loop; I was under the impression that using the a map function would shorten things--but I don't think it made much difference.

I do not know enough about this sort of thing to sort out what I can do to reduce the length of time it takes to run this function. I suspect it has to do with the massive number of word-pair combinations being run through the filter function, but beyond that I'm unsure.

Any suggestions would be appreciated.

docs_word_pairs <- function(x) {
 doc_by_word_df %>% 
  filter(if_any(-doc_id, ~ . %in% unique_keyword_pair[,x][1]) & 
          if_any(-doc_id, ~ . %in% unique_keyword_pair[,x][2])) %>% 
  mutate(keyword_pair = paste(c(unique_keyword_pair[,x][1],
                                unique_keyword_pair[,x][2]), 
                              collapse = "-"),
         keyword_1 = unique_keyword_pair[,x][1],
         keyword_2 = unique_keyword_pair[,x][2]) %>% 
  relocate(keyword_pair:keyword_2, .before = doc_id) %>% 
  group_by(keyword_pair) %>%
  summarize(n = n())
}
num_unique_keyword_pair <- length(unique_keyword_pair)/2

seq_num_unique_keyword_pair <- rep(c(1:num_unique_keyword_pair))

future::plan(multisession)

seq_num_unique_keyword_pair %>% 
 future_map_dfr(docs_word_pairs)
#> # A tibble: 8 × 2
#>   keyword_pair       n
#>   <chr>          <int>
#> 1 apple-bananna      1 # one document contains this key word pair
#> 2 apple-grain        1
#> 3 apple-insects      1
#> 4 apple-carrot       1
#> 5 fish-grain         1
#> 6 fish-insects       1
#> 7 bananna-carrot     1
#> 8 grain-insects      2 # two documents contain this key word pair

Created on 2022-04-18 by the reprex package (v2.0.1)


Solution

  • This can quicly be done as shown below:

    as.dist(crossprod(table(cbind(doc_by_word_df[,1],unlist(doc_by_word_df[-1])))))
    
            apple bananna carrot fish grain
    bananna     1                          
    carrot      1       1                  
    fish        0       0      0           
    grain       1       0      0    1      
    insects     1       0      0    1     2
    

    or even

    doc_by_word_df %>%
      pivot_longer(-doc_id) %>%
      select(-name) %>%
      table() %>%
      crossprod() %>%
      as.dist()
    
            apple bananna carrot fish grain
    bananna     1                          
    carrot      1       1                  
    fish        0       0      0           
    grain       1       0      0    1      
    insects     1       0      0    1     2
    

    If you want it as a dataframe do:

    df2 <- crossprod(table(cbind(doc_by_word_df[,1],unlist(doc_by_word_df[-1]))))
    subset(data.frame(as.table(as.matrix(as.dist(df2)))), Freq > 0)
          Var1    Var2 Freq
    2  bananna   apple    1
    3   carrot   apple    1
    5    grain   apple    1
    6  insects   apple    1
    7    apple bananna    1
    9   carrot bananna    1
    13   apple  carrot    1
    14 bananna  carrot    1
    23   grain    fish    1
    24 insects    fish    1
    25   apple   grain    1
    28    fish   grain    1
    30 insects   grain    2
    31   apple insects    1
    34    fish insects    1
    35   grain insects    2