rsankey-diagramnetworkd3

How to adjust a node width in a sankey plot in R?


I am running into a problem when creating a sankey plot because of the structure of my data.

Below is a sample dataframe. I am interested in the transformations of each feature across different stages and need the node sizes to represent a number of features per category at each stage. The problem is that at Stages 1 and 2 some of the features have two assigned categories. I need sankey to show that:

df = data.frame(Features = c("Feature1", "Feature2", "Feature3", "Feature4", "Feature5"),
                Stage1 = c("A", "A&B", "B", "B&C", "C"),
                Stage2 = c("D", "D&E", "F", "F&G", "G"),
                Stage3 = c("a", "b", "c", "d", "e"),
                Stage4 = c("f", "f", "f", "g", "g"))

I split the values in the nodes with more than one category

library(tidyr)

df_sep <- df|>
  separate_rows(Stage1, sep = "&")|>
  separate_rows(Stage2, sep = "&")

Now, I tried using ggsankey, but when make the df_sep long, the splitted rows are dublicated, and since the node size is controlled by the frequency in "x" and "next_x" columns, the node sizes at the Stage 3 become unequal.

library(ggplot2)
library(ggsankey)

sankey_data <- df_sep |>
  ggsankey::make_long(Stage1, Stage2, Stage3, Stage4)

ggplot(sankey_data, aes(x = x, 
                        next_x = next_x, 
                        node = node, 
                        next_node = next_node,
                        fill = node,
                        label = node)) +
  geom_sankey()

wrong sankey plot

I tried looking into networkD3, but the code below gives my a blank white sheet, not sure what's happening there...

library(tidyr)
library(dplyr)
library(networkD3)

links <- df_sep |> 
  pivot_longer(cols = c(Stage1, Stage2, Stage3, Stage4), 
               names_to = "source", 
               values_to = "target") |> 
  group_by(source, target) |>
  summarise(value = length(unique((Features)))) 

nodes <- data.frame(name = unique(c(links$source, links$target)))

sankeyNetwork(Links = links, Nodes = nodes,
              Source = "source", Target = "target",
              Value = "value", NodeID = "name",
              fontSize = 12, nodeWidth = 30)

Any suggestions on how can I controll the size of the nodes in ggsankey? Or how what is the problem with networkD3? Or better to use another package alltogether in this case? Or maybe some ideas on how to manipulate the data differently to fix this?


Solution

  • The sankey plot you've posted looks correct to me given the data you're using, so I'm not sure which lines you think are duplicated.

    The networkD3 code does not work because you must refer to the nodes in your links data frame with the 0-based index of the nodes in your nodes data frame, which you can achieve like this...

    df = data.frame(Features = c("Feature1", "Feature2", "Feature3", "Feature4", "Feature5"),
                    Stage1 = c("A", "A&B", "B", "B&C", "C"),
                    Stage2 = c("D", "D&E", "F", "F&G", "G"),
                    Stage3 = c("a", "b", "c", "d", "e"),
                    Stage4 = c("f", "f", "f", "g", "g"))
    
    df_sep <-
      df |>
      tidyr::separate_rows(Stage1, sep = "&") |>
      tidyr::separate_rows(Stage2, sep = "&")
    
    links <-
      df_sep |>
      dplyr::mutate(row = dplyr::row_number()) |>
      tidyr::pivot_longer(
        cols = dplyr::starts_with("Stage"),
        names_to = "col",
        names_pattern = "Stage(.*)",
        names_transform = as.integer,
        values_to = "source"
      ) |>
      dplyr::mutate(target = dplyr::lead(source), .by = "row") |>
      dplyr::filter(!is.na(target)) |>
      dplyr::summarise(value = dplyr::n(), .by = c(source, target))
    
    nodes <- data.frame(name = unique(c(links$source, links$target)))
    
    # use the 0-based index of nodes in the nodes data frame as the ID
    links$source_id <- match(links$source, nodes$name) - 1
    links$target_id <- match(links$target, nodes$name) - 1
    
    networkD3::sankeyNetwork(
      Links = links,
      Nodes = nodes,
      Source = "source_id",
      Target = "target_id",
      Value = "value",
      NodeID = "name",
      fontSize= 12,
      nodeWidth = 30
    )
    #> Links is a tbl_df. Converting to a plain data frame.