I have the following data frame, where each patient is a row (I am showing only a sample of it):
df = structure(list(firstY = c("N/A", "1", "3a", "3a", "3b", "1",
"2", "1", "5", "3b"), secondY = c("N/A", "1", "2", "3a", "4",
"1", "N/A", "1", "5", "3b"), ThirdY = c("N/A", "1", "N/A", "3b",
"4", "1", "N/A", "1", "N/A", "3b"), FourthY = c("N/A", "1", "N/A",
"3a", "4", "1", "N/A", "1", "N/A", "3a"), FifthY = c("N/A", "1",
"N/A", "2", "5", "1", "N/A", "N/A", "N/A", "3b")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
I would like to plot a Sankey diagram, which shows the trajectory over time of each patient, and I know that I have to create nodes and links, but I'm having problems transforming the data to the format necessary to accomplish this. Specifically, the most problematic issue is to count how many patients belong to each trajectory, for example, how many patients went in the first year from stage 1 to 2, and all other combinations.
Any help with the data preparation would be appreciated.
The package Alluvial, although simple to understand, does not cope really well in case there is a lot of data.
It's not very clear what you'd like to achieve, because you do not mention the package you'd like to use, but looking at your data, it seems that this could help, if you could use the alluvial
package:
library(alluvial) # sankey plots
library(dplyr) # data manipulation
The alluvial
functions can use data in wide form like yours, but it needs a frequency column, so we can create it, then do the plot:
dats_all <- df %>% # data
group_by( firstY, secondY, ThirdY, FourthY, FifthY) %>% # group them
summarise(Freq = n()) # add frequencies
# now plot it
alluvial( dats_all[,1:5], freq=dats_all$Freq, border=NA )
In the other hands, if you'd like to use a specific package, you should specify which.
EDIT
Using network3D is a bit tricky but you can maybe achieve some nice result from this. You need links and nodes, and have them matched, so first we can create the links:
# put your df in two columns, and preserve the ordering in many levels (columns) with paste0
links <- data.frame(source = c(paste0(df$firstY,'_1'),paste0(df$secondY,'_2'),paste0(df$ThirdY,'_3'),paste0(df$FourthY,'_4')),
target = c(paste0(df$secondY,'_2'),paste0(df$ThirdY,'_3'),paste0(df$FourthY,'_4'),paste0(df$FifthY,'_5')))
# now convert as character
links$source <- as.character(links$source)
links$target<- as.character(links$target)
Now the nodes are each element in the link in a unique()
way:
nodes <- data.frame(name = unique(c(links$source, links$target)))
Now we need that each nodes has a link (or vice-versa), so we match them and transform in numbers. Note the -1 at the end, because networkD3 is 0 indexes, it means that the numbers (indexes) starts from 0.
links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1
links$value <- 1 # add also a value
Now you should be ready to plot your sankey:
sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
Target = 'target', Value = 'value', NodeID = 'name')