So I'm working on creating an edges file for a social network analysis based on IMDb data. And I've run into a problem and I can't figure out how to fix it as I'm new to R.
Assuming I have the following dataframe:
movieID <- c('A', 'A','A', 'B','B', 'C','C', 'C')
crewID <- c('Z', 'Y', 'X', 'Z','V','V', 'X', 'Y')
rating <- c('7.3','7.3', '7.3', '2.1', '2.1', '9.0','9.0', '9.0')
df <- data.frame(movieID, crewID, rating)
movieID | CrewID | Rating |
---|---|---|
A | Z | 7.3 |
A | Y | 7.3 |
A | X | 7.3 |
B | Z | 2.1 |
B | V | 2.1 |
C | V | 9.0 |
C | X | 9.0 |
C | Y | 9.0 |
I am trying to build unique pairs of CrewIDs within a movie with a weight that equals the occurrence of that pair, meaning how often these two crew members have worked on a movie together. So basically I want a dataframe like the following as a result:
CrewID1 | CrewID2 | weight | (not a col but explanation) |
---|---|---|---|
Z | Y | 1 | together once in movie A |
Z | X | 1 | together once in movie A |
Y | X | 2 | together twice in movies A and C |
Z | V | 1 | together once in movie B |
V | X | 1 | together once in movie C |
V | Y | 1 | together once in movie C |
The pairs (Z,Y) and (Y,Z) are equal to each other as I don't care about direction.
I found the following StackOverflow thread on a similar issue: How to create pairs from a single column based on order of occurrence in R?
However in my case this skips the combination (V,Y) and (X,Z) and the count for (X,Y) is still 1 and I can't figure out how to fix it.
m <- crossprod(table(df[-3]))
m[upper.tri(m, diag = TRUE)] <-0
subset(as.data.frame.table(m), Freq > 0)
CrewID CrewID.1 Freq
2 X V 1
3 Y V 1
4 Z V 1
7 Y X 2
8 Z X 1
12 Z Y 1