rdataframedendrogram

Convert a data.frame to a tree structure object such as dendrogram


I have a data.frame object. For a simple example:

> data.frame(x=c('A','A','B','B','B'), y=c('Ab','Ac','Ba', 'Ba','Bd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))
  x  y   z
1 A Ab Abb
2 A Ac Acc
3 B Ba Bad
4 B Ba Bae
5 B Bd Bdd

there are a lot more rows and columns in the actual data. how could I create a nested tree structure object of dendrogram like this:

         |---Ab---Abb
     A---|
     |   |---Ac---Acc
   --|                 /--Bad 
     |   |---Ba-------|
     B---|             \--Bae
         |---Bb---Bdd

Solution

  • data.frame to Newick

    I did my PhD in computational phylogenetics and somewhere along the way I produced this code, that I used once or twice when I got some data in this nonstandard format (in phylogenetic sense). The script traverses the dataframe as if it were a tree ... and pastes stuff along the way into a Newick string, which is a standard format and can be then transformed in any kind of tree object.

    I guess the script could be optimized (I used it so rarely that more work on it would reduce the overall efficiency), but at least it is better to share than to let it collect dust laying around on my harddrive.

        ## recursion function
        traverse <- function(a,i,innerl){
            if(i < (ncol(df))){
                alevelinner <- as.character(unique(df[which(as.character(df[,i])==a),i+1]))
                desc <- NULL
                if(length(alevelinner) == 1) (newickout <- traverse(alevelinner,i+1,innerl))
                else {
                    for(b in alevelinner) desc <- c(desc,traverse(b,i+1,innerl))
                    il <- NULL; if(innerl==TRUE) il <- a
                    (newickout <- paste("(",paste(desc,collapse=","),")",il,sep=""))
                }
            }
            else { (newickout <- a) }
        }
    
        ## data.frame to newick function
        df2newick <- function(df, innerlabel=FALSE){
            alevel <- as.character(unique(df[,1]))
            newick <- NULL
            for(x in alevel) newick <- c(newick,traverse(x,1,innerlabel))
            (newick <- paste("(",paste(newick,collapse=","),");",sep=""))
        }
    

    The main function df2newick() takes two arguments:

    To demonstrate it on your example:

        df <- data.frame(x=c('A','A','B','B','B'), y=c('Ab','Ac','Ba', 'Ba','Bd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))
        myNewick <- df2newick(df)
        #[1] "((Abb,Acc),((Bad,Bae),Bdd));"
    

    Now you could read it into a object of class phylo with read.tree() from ape

        library(ape)
        mytree <- read.tree(text=myNewick)
        plot(mytree)
    

    If you want to add inner node labels to the Newick string, you can use this:

        myNewick <- df2newick(df, TRUE)
        #[1] "((Abb,Acc)A,((Bad,Bae)Ba,Bdd)B);"
    

    Hope this is useful (and maybe my PhD wasn't a complete waist of time ;-)


    Additional note for your dataframe format:

    As you can observe the df2newick function ignores inner modes with one child (which is anyway best to be used with most phylogenetic methods ... was only relevant to me). The df objects that I originally got and used with this script were of this format:

        df <- data.frame(x=c('A','A','B','B','B'), y=c('Abb','Acc','Ba', 'Ba','Bdd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))
    

    Very similar to yours ... but the "inner singe child nodes" just had the same name as their children, but you have different inner names for this nodes too, and the names get ignored ... might not be relevant but you can just ignore a part of the recursion function, like this:

        traverse <- function(a,i,innerl){
            if(i < (ncol(df))){
                alevelinner <- as.character(unique(df[which(as.character(df[,i])==a),i+1]))
                desc <- NULL
                ##if(length(alevelinner) == 1) (newickout <- traverse(alevelinner,i+1,innerl))
                ##else {
                    for(b in alevelinner) desc <- c(desc,traverse(b,i+1,innerl))
                    il <- NULL; if(innerl==TRUE) il <- a
                    (newickout <- paste("(",paste(desc,collapse=","),")",il,sep=""))
                ##}
            }
            else { (newickout <- a) }
        }
    

    and you would get something like this:

        [1] "(((Abb)Ab,(Acc)Ac)A,((Bad,Bae)Ba,(Bdd)Bd)B);"
    

    This really looks odd to me, but I add it just in case, cause it really includes now all the information from your original dataframe.