I have two data frames. First, a lookup table comprising a list of vertex names:
lookup <- data.frame(Name=c("Bob","Jane"))
Then I have an edge list that looks like this:
edges <- data.frame(vertex1 = c("Bob","Bill","Bob","Jane","Bill","Jane","Bob","Jane","Bob","Bill","Bob"
,"Jane","Bill","Jane","Bob","Jane","Jane","Jill","Jane","Susan","Susan"),
edgeID = c(1,1,1,1,1,1,2,2,1,1,1,1,1,1,2,2,3,3,3,3,3),
vertex2 = c("Bill","Bob","Jane","Bob","Jane","Jill","Jane","Bob","Bill","Bob"
,"Jane","Bob","Jane","Bill","Jane","Bob","Jill","Jane","Susan","Jane","Jill"))
For each unique vertex in the "lookup" table, I'd like to iterate through the "edges" table and label every edgeID where lookup$Name is among the vertices.
I can do that with the following script:
library(igraph)
g <- graph_from_data_frame(edges[c(1, 3, 2)], directed = FALSE)
do.call(
rbind,
c(
make.row.names = FALSE,
lapply(
as.character(lookup$Name),
function(nm) {
z <- c(nm, V(g)$name[distances(g, nm) == 1])
cbind(group = nm, unique(subset(edges, vertex1 %in% z & vertex2 %in% z)))
}
)
)
)
group vertex1 edgeID vertex2
1 Bob Bob 1 Bill
2 Bob Bill 1 Bob
3 Bob Bob 1 Jane
4 Bob Jane 1 Bob
5 Bob Bill 1 Jane
6 Bob Bob 2 Jane
7 Bob Jane 2 Bob
8 Bob Jane 1 Bill
9 Jane Bob 1 Bill
10 Jane Bill 1 Bob
11 Jane Bob 1 Jane
12 Jane Jane 1 Bob
13 Jane Bill 1 Jane
14 Jane Jane 1 Jill
15 Jane Bob 2 Jane
16 Jane Jane 2 Bob
17 Jane Jane 1 Bill
18 Jane Jane 3 Jill
19 Jane Jill 3 Jane
20 Jane Jane 3 Susan
21 Jane Susan 3 Jane
22 Jane Susan 3 Jill
The problem is that this seems inefficient for large edge lists. In my real data, "lookup" has 3,263 observations while "edges" has 167,775,170 observations. I've attempted to run the script above on an Amazon EC2 instance with 16 cores and 100GB or RAM for two days now with no end in sight (using "future_lapply" instead of "lapply" to allow for parallel processing). Is there any way that I can make this more efficient/faster?
This won't be the only time I need to group edges like this and I'm hoping to find a way to do it that isn't so expensive in terms of time and Amazon bills.
I think you can shrink your original data.frame edges
first, then you can avoid using unique
within lapply
for each iteration.
The code below may speed up a bit, but not sure how it gains in your real data.
edges.unique <- unique(edges[c(1, 3, 2)])
g <- graph_from_data_frame(edges.unique, directed = FALSE)
do.call(
rbind,
c(
make.row.names = FALSE,
lapply(
lookup$Name,
function(nm) {
z <- colnames(d <- distances(g, nm))[which(d < 2)]
cbind(group = nm, subset(edges.unique, vertex1 %in% z & vertex2 %in% z))
}
)
)
)
Update
edges.unique <- unique(
transform(
edges[c("vertex1", "vertex2", "edgeID")],
vertex1 = ifelse(vertex1 < vertex2, vertex1, vertex2),
vertex2 = ifelse(vertex1 < vertex2, vertex2, vertex1)
)
)
g <- graph_from_data_frame(edges.unique, directed = FALSE)
res <- do.call(
rbind,
c(
make.row.names = FALSE,
lapply(
lookup$Name,
function(nm) {
z <- colnames(d <- distances(g, nm))[which(d < 2)]
cbind(group = nm, subset(edges.unique, vertex1 %in% z & vertex2 %in% z))
}
)
)
)
gives
> res
group vertex1 vertex2 edgeID
1 Bob Bill Bob 1
2 Bob Bob Jane 1
3 Bob Bill Jane 1
4 Bob Bob Jane 2
5 Jane Bill Bob 1
6 Jane Bob Jane 1
7 Jane Bill Jane 1
8 Jane Jane Jill 1
9 Jane Bob Jane 2
10 Jane Jane Jill 3
11 Jane Jane Susan 3
12 Jane Jill Susan 3