rnlpvisualizationggraph

unable to map color onto a network graph from an additional variable using ggraph


I am attempting to create a text network graph. I am working with pivoted survey data, and attempting to associate words from open-ended comments with associated numeric responses. I've constructed word correlations and graphed them, but am having a devil of a time associating numeric values back into the network graph. I have experience with R, but I've not had formal training/classes and I feel confident I'm missing something pretty basic right now.

I was able to successfully create a plot using the following code, assuming graph is my data frame, containing variables x (raw numeric score from the survey data), row_number (to tie individual word used back to its initial open ended comment), word, n (# of times "word" appears in the dataset), and y (average of x per word).

graph %>%
  group_by(word) %>%
  filter(n() >= 1000)%>%
  pairwise_cor(word, row_number, upper=FALSE) %>%
  filter(correlation > .09) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  scale_color_gradient(low = "red", high = "green") +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

The pairwise_cor function essentially reshapes the dataframe into item1, item2, and correlation, dropping all other variables, meaning my relevant color-assigning variables were dropped, so I created a correlated words dataset, and then a final_df that joins individual word average scores (y) with the correlated_words dataset:

final <- cor_df %>%
  left_join(filt_df, by = join_by(item1 == word)) %>%
  left_join(filt_df, by = join_by(item2 == word))

"final" now contains item1 (word 1), item2 (word 2), correlation, n.1, y.1, n.2, and y.2 (where n is count of words and y is a weird stat: the average of X, the original survey numeric score associated that that word).

With the "final" data frame, I've now attempted a multitude of ways to map either y.1 or y.2 to the color of the nodes, generally something like:

as_tbl_graph(final)
ggraph(final, layout = "fr") +
 geom_node_point(aes(color = y.1), size = 5) +
 geom_node_text(aes(label = name), repel = TRUE) +
 scale_color_gradient(low = "red", high = "green") +
 theme_void()

This is the error I receive:

Error in geom_node_point(): ! Problem while computing aesthetics. ℹ Error occurred in the 1st layer. Caused by error in FUN(): ! object 'y.1' not found

Not sure exactly where I'm going wrong, although I have been poring through the documentation for ggraph and tidygraph. I don't have a full conceptual understanding of the various layout possibilities, which I feel is likely where my issues lie (or possibly my confusion starts in the construction of the dataframe itself via as_tbl_graph?), and would really welcome any additional resources or documentation towards understanding those algorithms/customizing layouts. (I've read https://cran.r-project.org/web/packages/ggraph/vignettes/Layouts.html and all of the ggraph vignettes!)

My question, boiled down, is: how can I use a numeric variable to add a color dimension to nodes in a network graph using ggraph (or more specifically, what the heck am I doing wrong)? Thanks in advance for any help!


Solution

  • The first issue with your code is that you are passing the data.frame final to ggraph() instead of the tbl_graph object as_tbl_graph(final).

    The second issue is that, when converting to a tbl_graph the y.1 and y.2 columns you added via the lef_join become a columns or features of the edges data not the nodes and are thus not available to be mapped on aesthetics in geom_node_xxx. To fix this second issue you have to convert cor_df to a tbl_graph first, then join your filt_df. This way the columns are added to the nodes data.

    Note: I do only one left_join as a second does not make sense for the nodes data. Also I renamed the column from y to value as I encountered a warning when using y.

    Using some fake data based on the highschool dataset from ggraph:

    library(ggraph)
    library(tidygraph)
    library(dplyr, warn.conflicts = FALSE)
    
    set.seed(123)
    
    # Create example data
    cor_df <- highschool
    names(cor_df) <- c("item1", "item2", "correlation")
    
    filt_df <- data.frame(
      word = as.character(unique(cor_df$item1)),
      y = runif(seq(length(unique(cor_df$item1))))
    ) |> 
      rename(value = y)
    
    final_graph <- as_tbl_graph(cor_df) |>
      left_join(
        filt_df,
        by = join_by(name == word)
      )
    
    ggraph(final_graph, layout = "fr") +
      geom_node_point(aes(color = value), size = 5) +
      geom_node_text(aes(label = name), repel = TRUE) +
      scale_color_gradient(low = "red", high = "green") +
      theme_void()