I have been stuck with computing efficiently the number of classmates for each student from a course-level database.
Consider this data.frame, where each row represents a course that a student has taken during a given semester:
dat <-
data.frame(
student = c(1, 1, 2, 2, 2, 3, 4, 5),
semester = c(1, 2, 1, 2, 2, 2, 1, 2),
course = c(2, 4, 2, 3, 4, 3, 2, 4)
)
# student semester course
# 1 1 1 2
# 2 1 2 4
# 3 2 1 2
# 4 2 2 3
# 5 2 2 4
# 6 3 2 3
# 7 4 1 2
# 8 5 2 4
Students are going to courses in a given semester. Their classmates are other students attending the same course during the same semester. For instance, across both semesters, student 1 has 3 classmates (students 2, 4 and 5).
How can I get the number of unique classmates each student has combining both semesters? The desired output would be:
student n
1 1 3
2 2 4
3 3 1
4 4 2
5 5 2
where n
is the value for the number of different classmates a student has had during the academic year.
I sense that an igraph
solution could possibly work (hence the tag), but my knowledge of this package is too limited. I also feel like using joins
could help, but again, I am not sure how.
Importantly, I would like this to work for larger datasets (mine has about 17M rows). Here's an example data set:
set.seed(1)
big_dat <-
data.frame(
student = sample(1e4, 1e6, TRUE),
semester = sample(2, 1e6, TRUE),
course = sample(1e3, 1e6, TRUE)
)
First try with igraph
:
library(data.table)
library(igraph)
setDT(dat)
i <- max(dat$student)
g <- graph_from_data_frame(
dat[,.(student, class = .GRP + i), .(semester, course)][,-1:-2]
)
v <- V(g)[1:uniqueN(dat$student)]
data.frame(student = as.integer(names(v)),
n = ego_size(g, 2, v, mindist = 2))
#> student n
#> 1 1 3
#> 2 2 4
#> 3 4 2
#> 4 5 2
#> 5 3 1
Note that if student
is not integer, you'll need to create a temporary integer id with match
on the unique value and then index on the final output.
With tcrossprod
:
library(data.table)
library(Matrix)
setDT(dat)
u <- unique(dat$student)
data.frame(
student = u,
n = colSums(
tcrossprod(
dat[,id := match(student, u)][
,.(i = id, j = .GRP), .(semester, course)
][,sparseMatrix(i, j)]
)
) - 1L
)
#> student n
#> 1 1 3
#> 2 2 4
#> 3 3 1
#> 4 4 2
#> 5 5 2