I am running into a problem when creating a sankey plot because of the structure of my data.
Below is a sample dataframe. I am interested in the transformations of each feature across different stages and need the node sizes to represent a number of features per category at each stage. The problem is that at Stages 1 and 2 some of the features have two assigned categories. I need sankey to show that:
df = data.frame(Features = c("Feature1", "Feature2", "Feature3", "Feature4", "Feature5"),
Stage1 = c("A", "A&B", "B", "B&C", "C"),
Stage2 = c("D", "D&E", "F", "F&G", "G"),
Stage3 = c("a", "b", "c", "d", "e"),
Stage4 = c("f", "f", "f", "g", "g"))
I split the values in the nodes with more than one category
library(tidyr)
df_sep <- df|>
separate_rows(Stage1, sep = "&")|>
separate_rows(Stage2, sep = "&")
Now, I tried using ggsankey, but when make the df_sep long, the splitted rows are dublicated, and since the node size is controlled by the frequency in "x" and "next_x" columns, the node sizes at the Stage 3 become unequal.
library(ggplot2)
library(ggsankey)
sankey_data <- df_sep |>
ggsankey::make_long(Stage1, Stage2, Stage3, Stage4)
ggplot(sankey_data, aes(x = x,
next_x = next_x,
node = node,
next_node = next_node,
fill = node,
label = node)) +
geom_sankey()
I tried looking into networkD3, but the code below gives my a blank white sheet, not sure what's happening there...
library(tidyr)
library(dplyr)
library(networkD3)
links <- df_sep |>
pivot_longer(cols = c(Stage1, Stage2, Stage3, Stage4),
names_to = "source",
values_to = "target") |>
group_by(source, target) |>
summarise(value = length(unique((Features))))
nodes <- data.frame(name = unique(c(links$source, links$target)))
sankeyNetwork(Links = links, Nodes = nodes,
Source = "source", Target = "target",
Value = "value", NodeID = "name",
fontSize = 12, nodeWidth = 30)
Any suggestions on how can I controll the size of the nodes in ggsankey? Or how what is the problem with networkD3? Or better to use another package alltogether in this case? Or maybe some ideas on how to manipulate the data differently to fix this?
The sankey plot you've posted looks correct to me given the data you're using, so I'm not sure which lines you think are duplicated.
The networkD3
code does not work because you must refer to the nodes in your links
data frame with the 0-based index of the nodes in your nodes
data frame, which you can achieve like this...
df = data.frame(Features = c("Feature1", "Feature2", "Feature3", "Feature4", "Feature5"),
Stage1 = c("A", "A&B", "B", "B&C", "C"),
Stage2 = c("D", "D&E", "F", "F&G", "G"),
Stage3 = c("a", "b", "c", "d", "e"),
Stage4 = c("f", "f", "f", "g", "g"))
df_sep <-
df |>
tidyr::separate_rows(Stage1, sep = "&") |>
tidyr::separate_rows(Stage2, sep = "&")
links <-
df_sep |>
dplyr::mutate(row = dplyr::row_number()) |>
tidyr::pivot_longer(
cols = dplyr::starts_with("Stage"),
names_to = "col",
names_pattern = "Stage(.*)",
names_transform = as.integer,
values_to = "source"
) |>
dplyr::mutate(target = dplyr::lead(source), .by = "row") |>
dplyr::filter(!is.na(target)) |>
dplyr::summarise(value = dplyr::n(), .by = c(source, target))
nodes <- data.frame(name = unique(c(links$source, links$target)))
# use the 0-based index of nodes in the nodes data frame as the ID
links$source_id <- match(links$source, nodes$name) - 1
links$target_id <- match(links$target, nodes$name) - 1
networkD3::sankeyNetwork(
Links = links,
Nodes = nodes,
Source = "source_id",
Target = "target_id",
Value = "value",
NodeID = "name",
fontSize= 12,
nodeWidth = 30
)
#> Links is a tbl_df. Converting to a plain data frame.