regexpython-2.7treetreenodeetetoolkit

python ete2 delete node from tree


I'm trying to remove a set of nodes from a given tree file.

Input : tree file+ txt file contains the name of nodes that should only exists in the tree.

this is my code: `

def find_id(file,id):

    f= open(file) 
    s = open(file).read()
    if re.search(id,s,re.MULTILINE):
        found= True
    else: 
        found= False
    return found

def remove_nodes(treeFile,idFile):  
        t= Tree(treeFile,format=8)
        removed=[]
        for node in t: 
              #print node.name
              if not find_id(idFile,'^'+node.name+'\s') and node.is_leaf():
                   n= node.delete()
                   removed.append(n)
        print removed
        t.write(format=1, outfile="newtree.nw")

    remove_nodes('arthropods.nw','taxidMap.txt')`

arthropods.nw is a newick tree file this is an extract:

((260574)58772(874683,874682,874681,874680,874685,874684,1096898,874676,874677,874678,874679)89902(((((61988,390855,109756,62003,374072,244964,146864,251422,388540,438507,681530)61987,(244997,1068629,485196,681527,126872,111303,58784,134582,89817,231264)58783)109754,((289475,390856,118505)118504)118506)61986(((((756952,756950,756951,171369,1053728,231396)171368,(980235)980234,(118484)118483,(126927)126926,(1147029,863609,89974,1255757...

taxidMap.txt :

135631 NC_015190
29137 NC_003314
29139 NC_003322
...

the problem is when I print the list "removed" it gives me a list of none, and I realise that the number of nodes in the tree is still > of the number of names in the input file any suggestion ? Thanks in advance


Solution

  • I am not sure if the rest of the code is working find whithout examples of the input files. But I found this, that may be changed:

    - removed.append(n)  
    + removed.append(node)
    

    n is actually equal to None, as the delete function does not return anything.

    pd: by the way for @houdini , the Tree class used is documented there: http://pythonhosted.org/ete2/reference/reference_tree.html

    EDIT:

    Ok, according to you input files, I would change your codelike this:

    from ete2 import Tree
    import re
    
    def find_id(file,id):
    
        f= open(file) 
        s = open(file).read()
        if re.search(id,s,re.MULTILINE):
            found= True
        else: 
            found= False
        return found
    
    def remove_nodes(treeFile,idFile):  
    
        t= Tree(treeFile,format=8)
        print t.get_ascii()
        removed=[]
        for node in t.iter_descendants():
            # print node.name
            if not find_id(idFile,'^'+node.name+'\s'):
                node.delete(prevent_nondicotomic=False)
                removed.append(node)
    
        print [n.name for n in removed]
        print t.get_ascii()
        t.write(format=1, outfile="newtree.nw")
    
    remove_nodes('arthropods.nw','taxidMap.txt')
    

    my tree file is:

    (58772,89902,((61988,390855)29139,((62003,374072)244964,146864,251422)388540,29137)61987);
    

    and my list of ids file:

    29137 NC_003314
    29139 NC_003322
    62003 NC_004444
    

    And here the output:

          /-58772
         |
         |--89902
         |
         |          /-61988
    -NoName    /29139
         |    |     \-390855
         |    |
         |    |            /-62003
         |    |      /244964
          \61987    |      \-374072
              |-388540
              |     |--146864
              |     |
              |      \-251422
              |
               \-29137
    ['58772', '89902', '61987', '388540', '61988', '390855', '244964', '146864', '251422', '374072']
    
          /-29139
         |
    -NoName-29137
         |
          \-62003
    

    EDIT2:

    To remove only leaves, just remove the iter_descendants, part, just as you were doing:

    def remove_nodes(treeFile,idFile):
    
        t= Tree(treeFile,format=8)
        print t
        removed=[]
        for node in t:
            # print node.name
            if not find_id(idFile,'^'+node.name+'\s'):
                node.delete(prevent_nondicotomic=False)
                removed.append(node)
    
        print [n.name for n in removed]
        print t
        t.write(format=1, outfile="newtree.nw")
    

    However in the example I am using the result is quite uggly :S ... perhaps with more nodes to keep it would be nicer.

       /-58772
      |
      |--89902
      |
      |      /-61988
    --|   /-|
      |  |   \-390855
      |  |
      |  |      /-62003
      |  |   /-|
       \-|  |   \-374072
         |--|
         |  |--146864
         |  |
         |   \-251422
         |
          \-29137
    ['58772', '89902', '61988', '390855', '374072', '146864', '251422']
    
          /-29139
         |
    -- /-|-- /- /-62003
         |
          \-29137