Efficient way to compare vertex lookup table with edge list and assign vertex label to any edge that matches

I have two data frames. First, a lookup table comprising a list of vertex names:

lookup <- data.frame(Name=c("Bob","Jane"))

Then I have an edge list that looks like this:

edges <- data.frame(vertex1 = c("Bob","Bill","Bob","Jane","Bill","Jane","Bob","Jane","Bob","Bill","Bob"
                              ,"Jane","Bill","Jane","Bob","Jane","Jane","Jill","Jane","Susan","Susan"),
                  edgeID = c(1,1,1,1,1,1,2,2,1,1,1,1,1,1,2,2,3,3,3,3,3),
                  vertex2 = c("Bill","Bob","Jane","Bob","Jane","Jill","Jane","Bob","Bill","Bob"
                              ,"Jane","Bob","Jane","Bill","Jane","Bob","Jill","Jane","Susan","Jane","Jill"))

For each unique vertex in the "lookup" table, I'd like to iterate through the "edges" table and label every edgeID where lookup$Name is among the vertices.

I can do that with the following script:

library(igraph)

g <- graph_from_data_frame(edges[c(1, 3, 2)], directed = FALSE)
do.call(
  rbind,
  c(
    make.row.names = FALSE,
    lapply(
      as.character(lookup$Name),
      function(nm) {
        z <- c(nm, V(g)$name[distances(g, nm) == 1])
        cbind(group = nm, unique(subset(edges, vertex1 %in% z & vertex2 %in% z)))
      }
    )
  )
)

   group vertex1 edgeID vertex2
1    Bob     Bob      1    Bill
2    Bob    Bill      1     Bob
3    Bob     Bob      1    Jane
4    Bob    Jane      1     Bob
5    Bob    Bill      1    Jane
6    Bob     Bob      2    Jane
7    Bob    Jane      2     Bob
8    Bob    Jane      1    Bill
9   Jane     Bob      1    Bill
10  Jane    Bill      1     Bob
11  Jane     Bob      1    Jane
12  Jane    Jane      1     Bob
13  Jane    Bill      1    Jane
14  Jane    Jane      1    Jill
15  Jane     Bob      2    Jane
16  Jane    Jane      2     Bob
17  Jane    Jane      1    Bill
18  Jane    Jane      3    Jill
19  Jane    Jill      3    Jane
20  Jane    Jane      3   Susan
21  Jane   Susan      3    Jane
22  Jane   Susan      3    Jill

The problem is that this seems inefficient for large edge lists. In my real data, "lookup" has 3,263 observations while "edges" has 167,775,170 observations. I've attempted to run the script above on an Amazon EC2 instance with 16 cores and 100GB or RAM for two days now with no end in sight (using "future_lapply" instead of "lapply" to allow for parallel processing). Is there any way that I can make this more efficient/faster?

This won't be the only time I need to group edges like this and I'm hoping to find a way to do it that isn't so expensive in terms of time and Amazon bills.

Solution

I think you can shrink your original data.frame edges first, then you can avoid using unique within lapply for each iteration.

The code below may speed up a bit, but not sure how it gains in your real data.

edges.unique <- unique(edges[c(1, 3, 2)])
g <- graph_from_data_frame(edges.unique, directed = FALSE)
do.call(
  rbind,
  c(
    make.row.names = FALSE,
    lapply(
      lookup$Name,
      function(nm) {
        z <- colnames(d <- distances(g, nm))[which(d < 2)]
        cbind(group = nm, subset(edges.unique, vertex1 %in% z & vertex2 %in% z))
      }
    )
  )
)

Update

edges.unique <- unique(
  transform(
    edges[c("vertex1", "vertex2", "edgeID")],
    vertex1 = ifelse(vertex1 < vertex2, vertex1, vertex2),
    vertex2 = ifelse(vertex1 < vertex2, vertex2, vertex1)
  )
)
g <- graph_from_data_frame(edges.unique, directed = FALSE)
res <- do.call(
  rbind,
  c(
    make.row.names = FALSE,
    lapply(
      lookup$Name,
      function(nm) {
        z <- colnames(d <- distances(g, nm))[which(d < 2)]
        cbind(group = nm, subset(edges.unique, vertex1 %in% z & vertex2 %in% z))
      }
    )
  )
)

gives

> res
   group vertex1 vertex2 edgeID
1    Bob    Bill     Bob      1
2    Bob     Bob    Jane      1
3    Bob    Bill    Jane      1
4    Bob     Bob    Jane      2
5   Jane    Bill     Bob      1
6   Jane     Bob    Jane      1
7   Jane    Bill    Jane      1
8   Jane    Jane    Jill      1
9   Jane     Bob    Jane      2
10  Jane    Jane    Jill      3
11  Jane    Jane   Susan      3
12  Jane    Jill   Susan      3

When you type plot(g), you will see the simplified as below