I am attempting to create a text network graph. I am working with pivoted survey data, and attempting to associate words from open-ended comments with associated numeric responses. I've constructed word correlations and graphed them, but am having a devil of a time associating numeric values back into the network graph. I have experience with R, but I've not had formal training/classes and I feel confident I'm missing something pretty basic right now.
I was able to successfully create a plot using the following code, assuming graph is my data frame, containing variables x (raw numeric score from the survey data), row_number (to tie individual word used back to its initial open ended comment), word, n (# of times "word" appears in the dataset), and y (average of x per word).
graph %>%
group_by(word) %>%
filter(n() >= 1000)%>%
pairwise_cor(word, row_number, upper=FALSE) %>%
filter(correlation > .09) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
scale_color_gradient(low = "red", high = "green") +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
The pairwise_cor function essentially reshapes the dataframe into item1, item2, and correlation, dropping all other variables, meaning my relevant color-assigning variables were dropped, so I created a correlated words dataset, and then a final_df that joins individual word average scores (y) with the correlated_words dataset:
final <- cor_df %>%
left_join(filt_df, by = join_by(item1 == word)) %>%
left_join(filt_df, by = join_by(item2 == word))
"final" now contains item1 (word 1), item2 (word 2), correlation, n.1, y.1, n.2, and y.2 (where n is count of words and y is a weird stat: the average of X, the original survey numeric score associated that that word).
With the "final" data frame, I've now attempted a multitude of ways to map either y.1 or y.2 to the color of the nodes, generally something like:
as_tbl_graph(final)
ggraph(final, layout = "fr") +
geom_node_point(aes(color = y.1), size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
scale_color_gradient(low = "red", high = "green") +
theme_void()
This is the error I receive:
Error in geom_node_point()
:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in FUN()
:
! object 'y.1' not found
Not sure exactly where I'm going wrong, although I have been poring through the documentation for ggraph and tidygraph. I don't have a full conceptual understanding of the various layout possibilities, which I feel is likely where my issues lie (or possibly my confusion starts in the construction of the dataframe itself via as_tbl_graph?), and would really welcome any additional resources or documentation towards understanding those algorithms/customizing layouts. (I've read https://cran.r-project.org/web/packages/ggraph/vignettes/Layouts.html and all of the ggraph vignettes!)
My question, boiled down, is: how can I use a numeric variable to add a color dimension to nodes in a network graph using ggraph (or more specifically, what the heck am I doing wrong)? Thanks in advance for any help!
The first issue with your code is that you are passing the data.frame
final
to ggraph()
instead of the tbl_graph
object as_tbl_graph(final)
.
The second issue is that, when converting to a tbl_graph
the y.1
and y.2
columns you added via the lef_join
become a columns or features of the edges data not the nodes and are thus not available to be mapped on aesthetics in geom_node_xxx
. To fix this second issue you have to convert cor_df
to a tbl_graph
first, then join your filt_df
. This way the columns are added to the nodes data.
Note: I do only one left_join
as a second does not make sense for the nodes data. Also I renamed the column from y
to value
as I encountered a warning when using y
.
Using some fake data based on the highschool
dataset from ggraph
:
library(ggraph)
library(tidygraph)
library(dplyr, warn.conflicts = FALSE)
set.seed(123)
# Create example data
cor_df <- highschool
names(cor_df) <- c("item1", "item2", "correlation")
filt_df <- data.frame(
word = as.character(unique(cor_df$item1)),
y = runif(seq(length(unique(cor_df$item1))))
) |>
rename(value = y)
final_graph <- as_tbl_graph(cor_df) |>
left_join(
filt_df,
by = join_by(name == word)
)
ggraph(final_graph, layout = "fr") +
geom_node_point(aes(color = value), size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
scale_color_gradient(low = "red", high = "green") +
theme_void()