I've got a dataset with distinct source and target nodes, as well as a numeric variable that's relevant to the relationship.
It looks a bit like this:
library(igraph)
library(tidygraph)
set.seed(24601)
example_data <-
data.frame(source =
sample(letters[1:10],
100,
replace = TRUE),
target =
sample(letters[16:25],
100,
replace = TRUE),
important_variable =
rnorm(100))
Imagine that the members of source are individuals, members of target are different cities that they've travelled to, and I want to create a network that shows when two given cities were visited by the same person. I'd use bipartite_projection() for this, like so:
example_data %>%
graph_from_data_frame() %>%
as_tbl_graph() %>%
mutate(type =
ifelse(name %in% letters[1:10],
TRUE,
FALSE)) %>%
bipartite_projection(which = "true")
However, I'd like to connect different cities only when a certain condition is met: for example, when the difference in the values of important_variable is a maximum of 0.5 (say, I'm interested when two cities have been visited by the same person in the same year). At the moment, the information from important_variable is discarded after the use of bipartite_projection.
I can't see a means of restricting the bipartite_projection based on a third numeric variable. Is it possible to do so? Thanks in advance for any help.
Let's look at a small number of rows:
example_data %>%
filter(source == "a") %>%
head()
This produces the following:
source target important_variable
1 a x 0.29773720
2 a p 1.50474490
3 a y 0.01149263
4 a q 0.19391773
5 a t -0.10656946
6 a w -0.29516668
I can go straight into a bipartite projection, like so:
example_data %>%
filter(source == "a") %>%
head() %>%
graph_from_data_frame() %>%
as_tbl_graph() %>%
mutate(type =
ifelse(name %in% letters[1:10],
TRUE,
FALSE)) %>%
bipartite_projection(which = "false")
which produces an igraph object with one vertex attribute - name
- and one edge attribute - node
.
However, I'd like something that looks like this (just the first four rows for simplicity):
source_projected target_projected source_att target_att
1 x p 0.2977372 1.50474490
2 x y 0.2977372 0.01149263
3 x q 0.2977372 0.19391773
4 x t 0.2977372 -0.10656946
as this would allow me to filter based on the relationship between my source_att
and target_att
columns (for example, filtering where the difference between source_att
and target_att
is less than 0.5)
@ThomasIsCoding has provided a solution that fits with my request. This has made me realise that I wasn't sufficiently detailed.
Starting again with the original data, we can see that a
is linked to p
twice, and a
is linked to y
twice. In each case, the value of important_variable
is different. See below:
example_data %>%
filter(source == "a" &
(target == "p" |
target == "y"))
source target important_variable
1 a p 1.50474490
2 a y 0.01149263
3 a y -2.34069094
4 a p 0.29294049
The example desired data that I posted only includes each node within target
being connected once. However, because the values of important_variable
differ, I'd like output that includes all configurations of those pairings, to look like so:
source_projected target_projected source_att target_att
1 p y 0.2977372 0.01149263
2 p y 0.2977372 -2.34069094
3 p y 0.2929405 0.01149263
4 p y 0.2929405 -2.34069094
Is this something that's possible to construct? Thanks!
Since you may have multiple values for a single target, I guess it would be better to use left_join
and enable "many-to-many"
for the relationship
argument
out <- example_data %>%
graph_from_data_frame() %>%
set_vertex_attr(
name = "type",
value = names(V(.)) %in% example_data$target
) %>%
bipartite_projection() %>%
pluck("proj2") %>%
as_data_frame() %>%
select(-weight) %>%
left_join(select(example_data, -source),
join_by(from == target),
relationship = "many-to-many"
) %>%
left_join(select(example_data, -source),
join_by(to == target),
relationship = "many-to-many"
) %>%
rename(all_of(c(source_att = "important_variable.x", target_att = "important_variable.y")))
and you will see
> head(out)
from to source_att target_att
1 x y 0.2977372 0.50506407
2 x y 0.2977372 -1.37333412
3 x y 0.2977372 0.61981223
4 x y 0.2977372 0.43724194
5 x y 0.2977372 -1.97363488
6 x y 0.2977372 -0.02413137
> glimpse(out)
Rows: 4,462
Columns: 4
$ from <chr> "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x",…
$ to <chr> "y", "y", "y", "y", "y", "y", "y", "y", "y", "y", "y", "y",…
$ source_att <dbl> 0.2977372, 0.2977372, 0.2977372, 0.2977372, 0.2977372, 0.29…
$ target_att <dbl> 0.50506407, -1.37333412, 0.61981223, 0.43724194, -1.9736348…
Probably you can try the code below
example_data %>%
graph_from_data_frame() %>%
set_vertex_attr(
name = "type",
value = names(V(.)) %in% example_data$target
) %>%
bipartite_projection() %>%
pluck("proj2") %>%
as_data_frame() %>%
select(-weight) %>%
mutate(
source_att = with(example_data, important_variable[match(from, target)]),
target_att = with(example_data, important_variable[match(to, target)])
)
which gives
from to source_att target_att
1 x y 0.29773720 0.50506407
2 x p 0.29773720 -0.74022203
3 x u 0.29773720 -2.04969760
4 x q 0.29773720 1.36281039
5 x w 0.29773720 -0.47578690
6 x s 0.29773720 0.03233063
7 x t 0.29773720 -1.08378137
8 x r 0.29773720 -0.72029435
9 x v 0.29773720 -0.22919308
10 y p 0.50506407 -0.74022203
11 y u 0.50506407 -2.04969760
12 y q 0.50506407 1.36281039
13 y w 0.50506407 -0.47578690
14 y s 0.50506407 0.03233063
15 y t 0.50506407 -1.08378137
16 y r 0.50506407 -0.72029435
17 y v 0.50506407 -0.22919308
18 p u -0.74022203 -2.04969760
19 p q -0.74022203 1.36281039
20 p w -0.74022203 -0.47578690
21 p s -0.74022203 0.03233063
22 p t -0.74022203 -1.08378137
23 p r -0.74022203 -0.72029435
24 p v -0.74022203 -0.22919308
25 r u -0.72029435 -2.04969760
26 r q -0.72029435 1.36281039
27 r w -0.72029435 -0.47578690
28 r s -0.72029435 0.03233063
29 r t -0.72029435 -1.08378137
30 r v -0.72029435 -0.22919308
31 u q -2.04969760 1.36281039
32 u w -2.04969760 -0.47578690
33 u s -2.04969760 0.03233063
34 u t -2.04969760 -1.08378137
35 u v -2.04969760 -0.22919308
36 v s -0.22919308 0.03233063
37 v t -0.22919308 -1.08378137
38 v q -0.22919308 1.36281039
39 v w -0.22919308 -0.47578690
40 q w 1.36281039 -0.47578690
41 q s 1.36281039 0.03233063
42 q t 1.36281039 -1.08378137
43 w s -0.47578690 0.03233063
44 w t -0.47578690 -1.08378137
45 s t 0.03233063 -1.08378137
and then I guess you know how to filter the rows with a constraint on the the difference between source_att
and target_att
.