rpairwise

R: Create a pairwise matrix of counts of shared values between items in one column and features in another


I have a data thats's a bit like this:

items <- c("A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D")
features <- c("ab", "ac", "ad", "ab", "ab", "az", "ay", "az", "az", "al", "ab", "ad", "aa", "ac")
df <- data.frame(items, features)

Which gives a data frame like:

   items features
1      A       ab
2      A       ac
3      A       ad
4      A       ab
5      B       ab
6      B       az
7      B       ay
8      C       az
9      C       az
10     C       al
11     C       ab
12     C       ad
13     D       aa
14     D       ac

I would like to create two pairwise comparisons from the above data frame. The first is a comparison of every item with every other item that counts how many features are shared between them. The second is a comparison of every item with every other item that gives a character string of the shared features (separated by spaces, for example).

I have been able to do this using a loop with another loop within it to compare "A" with "B" and so on, and then "B" with all others, and so on, but this is a very slow process. The real data frame has ~2000 items in it and the compute time gets out of hand pretty quickly as the dataset grows.

The code I have used is like this:

item_list <- unique(df$items)
feature_count <- data.frame(matrix(ncol = length(item_list), nrow = length(item_list)))
colnames(feature_count) <- item_list
rownames(feature_count) <- item_list
feature_details <- data.frame(matrix(ncol = length(item_list), nrow = length(item_list)))
colnames(feature_details) <- item_list
rownames(feature_details) <- item_list
for (n in 1:length(item_list)){
  item <- df[df$item == item_list[n],]
  item_features <- as.list(item$features)
  for (z in 1:length(item_list)){
    comparison <- df[df$item == item_list[z],]
    comparison_features <- as.list(comparison$features)
    if (length(intersect(item_features, comparison_features)) == 0) {
      feature_count[z,n] <- length(intersect(item_features, comparison_features))
      feature_details[z,n] <- NA
    } else {
      feature_count[z,n] <- length(intersect(item_features, comparison_features))
      feature_details[z,n] <- paste(intersect(item_features, comparison_features), collapse = " ")
    }
  }
  diag(feature_count) <- 0
  diag(feature_details) <- NA
}

And returns two data frames like this:

feature_count

  A B C D
A 0 1 2 1
B 1 0 2 0
C 2 2 0 0
D 1 0 0 0

feature_details

      A     B     C    D
A  <NA>    ab ab ad   ac
B    ab  <NA> az ab <NA>
C ab ad ab az  <NA> <NA>
D    ac  <NA>  <NA> <NA>

The above seems like an inelegant and inefficient way of doing this. Could anyone offer any advice on a simpler approach to achieve the same thing that will make working with much, much larger datasets more doable?


Solution

  • Here's a solution using outer(). First, make the data:

    items <- c("A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D")
    features <- c("ab", "ac", "ad", "ab", "ab", "az", "ay", "az", "az", "al", "ab", "ad", "aa", "ac")
    df <- data.frame(items, features)
    

    Split the data into groups baed on items, using split() will retain the group names as the names of the elements. Also save the names of the list elements as a vector called n_split.

    items_spl <- split(df, items)
    n_split <-names(items_spl)
    

    The idea using outer() is to generate the outer product of n_split and n_split which will make a matrix that has the list names on the row and column. The function we use here isn't the product, but we are taking the length of the intersection of the unique features from each item's list. That gives the count matrix. Pasting together the intersection of unique features gives the shared feature matrix.

    count <-outer(n_split,
          n_split, 
          Vectorize(
            function(x,y)length(intersect(unique(items_spl[[x]]$features), 
                                          unique(items_spl[[y]]$features)))))
    diag(count) <-0
    
    shared <-outer(n_split,
                  n_split, 
                  Vectorize(
                    function(x,y)paste(intersect(unique(items_spl[[x]]$features), 
                                 unique(items_spl[[y]]$features)), 
                       collapse=" ")))
    diag(shared) <- ""
    rownames(count) <- colnames(count) <- 
    rownames(shared) <- colnames(shared) <- n_split
    
    count
    #>   A B C D
    #> A 0 1 2 1
    #> B 1 0 2 0
    #> C 2 2 0 0
    #> D 1 0 0 0
    
    shared
    #>   A       B       C       D   
    #> A ""      "ab"    "ab ad" "ac"
    #> B "ab"    ""      "ab az" ""  
    #> C "ab ad" "az ab" ""      ""  
    #> D "ac"    ""      ""      ""
    

    Created on 2023-08-11 with reprex v2.0.2