pythonphylogenyete3

Python / ete3: Target most closely related leaf to a specific species in a phylogenetic tree


I am using the Python package ete3. I have trees such as:

((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);

I would like to see the most closely related leaf to a specific node in the tree (here the tree is Species1_order1). In the example, the most closely related leaves are Species2_order2/ Species3_order2, and Species4_order3/Species5_order5.

Code:

tree = ete3.Tree('((Species1_order1, \
                    (Species2_order2, Species3_order2)), \
                   Species4_order3, Species5_order5);')

New example :

    tree=ete3.Tree('((((((A,B),C),D),(E,F)),G),(H,I));')

The result I get is :

         A    B    C    D    E    F    G    H    I
    A  0.0  2.0  3.0  4.0  6.0  6.0  6.0  8.0  8.0
    B  2.0  0.0  3.0  4.0  6.0  6.0  6.0  8.0  8.0
    C  3.0  3.0  0.0  3.0  5.0  5.0  5.0  7.0  7.0
    D  4.0  4.0  3.0  0.0  4.0  4.0  4.0  6.0  6.0
    E  6.0  6.0  5.0  4.0  0.0  2.0  4.0  6.0  6.0
    F  6.0  6.0  5.0  4.0  2.0  0.0  4.0  6.0  6.0
    G  6.0  6.0  5.0  4.0  4.0  4.0  0.0  4.0  4.0
    H  8.0  8.0  7.0  6.0  6.0  6.0  4.0  0.0  2.0
    I  8.0  8.0  7.0  6.0  6.0  6.0  4.0  2.0  0.0

But for instance E and F have an equaly distance to A,B,C and D in the tree and in the result they appear to be clother to D.

A good matrix result should rather be :


        A   B   C   D   E   F   G   H   I
    A   0   1   2   3   4   4   5   6   6
    B   1   0   2   3   4   4   5   6   6
    C   2   2   0   3   4   4   5   6   6
    D   3   3   3   0   4   4   5   6   6
    E   4   4   4   4   0   1   5   6   6
    F   4   4   4   4   1   0   5   6   6
    G   5   5   5   5   5   5   0   6   6
    H   6   6   6   6   6   6   6   0   1
    I   6   6   6   6   6   6   6   1   0

is not it ?


Solution

  • As discussed in the comments, ete3 gives us a function called Tree.get_closest_leaf, but it's output is not what is expected (and I am not sure what this value even represents here):

    >>> t=ete3.Tree('((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);')
    >>> t.get_closest_leaf('Species2_order2')
    (Tree node 'Species4_order3' (0x115b2f29), 0.0)
    

    Instead, you can get the node distance like this:

    import ete3
    import pandas as pd
    
    def make_matrix(tree):
        def get_root_path(node):
            root_path = [node]
            if node.up:
                root_path.extend(get_root_path(node.up))
            return root_path
        leaves = tree.get_leaves()
        leaf_ct = len(leaves)
        paths = {node.name: set(get_root_path(node)) for node in leaves}
        col_lbls = [leaf.name for leaf in leaves]
        dist_matrix = pd.np.array([pd.np.zeros(leaf_ct)] * leaf_ct)
        df = pd.DataFrame(dist_matrix, index=col_lbls, columns=col_lbls)
        for node1_name, col in df.iteritems():
            for node2_name in col.keys():
                path = paths[node2_name].symmetric_difference(paths[node1_name])
                dist = sum(node.dist for node in path)
                df.at[node1_name, node2_name] = dist
                df.at[node2_name, node1_name] = dist
        return df
    

    Note: This is a suboptimal solution for several reasons, but this question is not asking for the most most efficient solution. see this link for much more information about phylogenetic distance matrix methods.

    This solution also uses pandas which is overkill, since it is really just for the convenience of row/column labels. It would not be difficult to remove the pandas dependencies and do it with native lists instead.

    Here is the output:

    >>> tree=ete3.Tree('((Species1_order1, (Species2_order2, Species3_order2)), Species4_order3, Species5_order5);')
    >>> make_matrix(tree)
                     Species1_order1  Species2_order2  Species3_order2  Species4_order3  Species5_order5
    Species1_order1              0.0              3.0              3.0              3.0              3.0
    Species2_order2              3.0              0.0              2.0              4.0              4.0
    Species3_order2              3.0              2.0              0.0              4.0              4.0
    Species4_order3              3.0              4.0              4.0              0.0              2.0
    Species5_order5              3.0              4.0              4.0              2.0              0.0
    

    For the updates posted, I am not seeing anything wrong. It appears to give correct results. Here is the tree as rendered by ete3 (I highlighted the 4 hops that are counted in the distance from Interest_sequence to Rhopalosiphum_maidis_Hemiptera):

    and here is the matrix column for Interest_sequence that corresponds to it:

    >>> m['Interest_sequence']
    Rhopalosiphum_maidis__Hemiptera            4.0
    Drosophila_novamexicana__Hemiptera         5.0
    Drosophila_arizonae__Hemiptera             6.0
    Drosophila_navojoa__Hemiptera              6.0
    Interest_sequence                          0.0
    Heliothis_virescens_droso_3a__nan          5.0
    Mythimna_separata_droso__nan               6.0
    Heliothis_virescens_droso_3i__nan          6.0
    Scaptodrosophila_lebanonensis__Diptera     5.0
    Mythimna_unipuncta_droso_A__nan            6.0
    Xestia_c-nigrum_droso__nan                 8.0
    Helicoverpa_armigera_droso__nan            8.0
    Mocis_latipes_droso__nan                   7.0
    Drosophila_busckii__Diptera                4.0
    Drosophila_bipectinata__Diptera            5.0
    Drosophila_mojavensis__Diptera             7.0
    Drosophila_yakuba__Diptera                 7.0
    Drosophila_hydei__Diptera                  7.0
    Drosophila_serrata__Diptera                8.0
    Drosophila_takahashii__Diptera             9.0
    Drosophila_eugracilis__Diptera            11.0
    Drosophila_ficusphila__Diptera            11.0
    Drosophila_erecta__Diptera                12.0
    Drosophila_melanogaster__Diptera          13.0
    Sequence_A_nan__nan                       14.0
    Drosophila_sechellia__Diptera             15.0
    Drosophila_simulans__Diptera              15.0
    Drosophila_suzukii__Diptera               12.0
    Drosophila_biarmipes__Diptera             12.0
    Name: Interest_sequence, dtype: float64