pythonalgorithmhierarchical-clusteringphylogeny

Python make newick format using dataframe with 0s and 1s


I have a dataframe like this

      a b c d e f g h i j k l m
mut1  0 0 0 0 0 1 1 1 1 1 1 1 1 
mut2  0 0 0 0 0 1 1 1 1 1 0 0 0 
mut3  0 0 0 0 0 1 1 0 0 0 0 0 0 
mut4  0 0 0 0 0 1 0 0 0 0 0 0 0 
mut5  0 0 0 0 0 0 0 1 1 0 0 0 0 
mut6  0 0 0 0 0 0 0 1 0 0 0 0 0 
mut7  0 0 0 0 0 0 0 0 0 1 0 0 0 
mut8  0 0 0 0 0 0 0 0 0 0 1 1 1 
mut9  0 0 0 0 0 0 0 0 0 0 1 1 0 
mut10 0 0 0 0 0 0 0 0 0 0 0 0 1 
mut11 1 1 1 1 1 0 0 0 0 0 0 0 0 
mut12 1 1 1 0 0 0 0 0 0 0 0 0 0 
mut13 1 1 0 0 0 0 0 0 0 0 0 0 0 
mut14 1 0 0 0 0 0 0 0 0 0 0 0 0 
mut15 0 0 0 1 0 0 0 0 0 0 0 0 0 
mut16 0 0 0 0 1 0 0 0 0 0 0 0 0 

and origianl corresponding string

(a:0,b:0,c:0,d:0,e:0,f:0,g:0,h:0,i:0,j:0,k:0,l:0,m:0):0

The algorithm I thought was like this.

In row mut1, we can see that f,g,h,i,j,k,l,m have the same features. So the string can be modified into

(a:0,b:0,c:0,d:0,e:0,(f:0,g:0,h:0,i:0,j:0,k:0,l:0,m:0):0):0

In row mut2, we can see that f,g,h,i,j have the same features. So the string can be modified into

(a:0,b:0,c:0,d:0,e:0,((f:0,g:0,h:0,i:0,j:0):0,k:0,l:0,m:0):0):0

Until mut10, it continues to cluster samples in f,g,h,i,j,k,l,m.

And the output will be

(a:0,b:0,c:0,d:0,e:0,(((f:0,g:0):0,(h:0,i:0):0,j:0):0,((k:0,l:0):0,m:0):0):0):0

(For a row with one "1", just skip the process)

From mut10, it stars to cluster samples a,b,c,d,e

and similarly, the final output will be

(((a:0,b:0):0,c:0):0,d:0,e:0,(((f:0,g:0):0,(h:0,i:0):0,j:0):0,((k:0,l:0):0,m:0):0):0):0

So the algorithm is

  1. Cluster the samples with the same features.
  2. After clustering, add ":0" behind the closing parenthesis.

Any suggestions on this process?

*p.s. I have uploaded similar question Creating a newick format from dataframe with 0 and 1 but this one is more detailed.


Solution

  • Your question asks for a solution in Python, which I'm not familiar with. Hopefully, the following procedure in R will be helpful as well.

    What your question describes is matrix representation of a tree. Such a tree can be retrieved from the matrix with a maximum parsimony method using the phangorn package. To manipulate trees in R, newick format is useful. Newick differs from the tree representation in your question by ending with a semicolon.

    First, prepare a starting tree in phylo format.

    library(phangorn)
    tree0 <- read.tree(text = "(a,b,c,d,e,f,g,h,i,j,k,l,m);")
    

    Second, convert your data.frame to a phyDat object, where the rows represent samples and columns features. The phyDat object also requires what levels are present in the data, which is 0 and 1 in this case. Combining the starting tree with the data, we calculate the maximum parsimony tree.

    dat0 = read.table(text = "      a b c d e f g h i j k l m
        mut1  0 0 0 0 0 1 1 1 1 1 1 1 1 
        mut2  0 0 0 0 0 1 1 1 1 1 0 0 0 
        mut3  0 0 0 0 0 1 1 0 0 0 0 0 0 
        mut4  0 0 0 0 0 1 0 0 0 0 0 0 0 
        mut5  0 0 0 0 0 0 0 1 1 0 0 0 0 
        mut6  0 0 0 0 0 0 0 1 0 0 0 0 0 
        mut7  0 0 0 0 0 0 0 0 0 1 0 0 0 
        mut8  0 0 0 0 0 0 0 0 0 0 1 1 1 
        mut9  0 0 0 0 0 0 0 0 0 0 1 1 0 
        mut10 0 0 0 0 0 0 0 0 0 0 0 0 1 
        mut11 1 1 1 1 1 0 0 0 0 0 0 0 0 
        mut12 1 1 1 0 0 0 0 0 0 0 0 0 0 
        mut13 1 1 0 0 0 0 0 0 0 0 0 0 0 
        mut14 1 0 0 0 0 0 0 0 0 0 0 0 0 
        mut15 0 0 0 1 0 0 0 0 0 0 0 0 0 
        mut16 0 0 0 0 1 0 0 0 0 0 0 0 0")
    
    dat1 <- phyDat(data = t(dat0), 
        type = "USER",
        levels = c(0, 1))
    
    tree1 <- optim.parsimony(tree = tree0, data = dat1)
    plot(tree1)
    

    enter image description here

    The tree now contains a cladogram with no branch lengths. Class phylo is effectively a list, so the zero branch lengths can be added as an extra element.

    tree2 <- tree1
    tree2$edge.length <- rep(0, nrow(tree2$edge))
    

    Last, we write the tree into a character vector in newick format and remove the semicolon at the end to match the requirement.

    tree3 <- write.tree(tree2)
    tree3 <- sub(";", "", tree3)
    tree3
    # [1] "((e:0,d:0):0,(c:0,(b:0,a:0):0):0,((m:0,(l:0,k:0):0):0,((i:0,h:0):0,j:0,(g:0,f:0):0):0):0)"