I am using the Python package ete3
. I have trees such as:
((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);
I would like to see the most closely related leaf to a specific node in the tree (here the tree is Species1_order1
).
In the example, the most closely related leaves are Species2_order2
/ Species3_order2
, and Species4_order3
/Species5_order5
.
Code:
tree = ete3.Tree('((Species1_order1, \
(Species2_order2, Species3_order2)), \
Species4_order3, Species5_order5);')
New example :
tree=ete3.Tree('((((((A,B),C),D),(E,F)),G),(H,I));')
The result I get is :
A B C D E F G H I
A 0.0 2.0 3.0 4.0 6.0 6.0 6.0 8.0 8.0
B 2.0 0.0 3.0 4.0 6.0 6.0 6.0 8.0 8.0
C 3.0 3.0 0.0 3.0 5.0 5.0 5.0 7.0 7.0
D 4.0 4.0 3.0 0.0 4.0 4.0 4.0 6.0 6.0
E 6.0 6.0 5.0 4.0 0.0 2.0 4.0 6.0 6.0
F 6.0 6.0 5.0 4.0 2.0 0.0 4.0 6.0 6.0
G 6.0 6.0 5.0 4.0 4.0 4.0 0.0 4.0 4.0
H 8.0 8.0 7.0 6.0 6.0 6.0 4.0 0.0 2.0
I 8.0 8.0 7.0 6.0 6.0 6.0 4.0 2.0 0.0
But for instance E and F have an equaly distance to A,B,C and D in the tree and in the result they appear to be clother to D.
A good matrix result should rather be :
A B C D E F G H I
A 0 1 2 3 4 4 5 6 6
B 1 0 2 3 4 4 5 6 6
C 2 2 0 3 4 4 5 6 6
D 3 3 3 0 4 4 5 6 6
E 4 4 4 4 0 1 5 6 6
F 4 4 4 4 1 0 5 6 6
G 5 5 5 5 5 5 0 6 6
H 6 6 6 6 6 6 6 0 1
I 6 6 6 6 6 6 6 1 0
is not it ?
As discussed in the comments, ete3
gives us a function called Tree.get_closest_leaf
, but it's output is not what is expected (and I am not sure what this value even represents here):
>>> t=ete3.Tree('((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);')
>>> t.get_closest_leaf('Species2_order2')
(Tree node 'Species4_order3' (0x115b2f29), 0.0)
Instead, you can get the node distance like this:
import ete3
import pandas as pd
def make_matrix(tree):
def get_root_path(node):
root_path = [node]
if node.up:
root_path.extend(get_root_path(node.up))
return root_path
leaves = tree.get_leaves()
leaf_ct = len(leaves)
paths = {node.name: set(get_root_path(node)) for node in leaves}
col_lbls = [leaf.name for leaf in leaves]
dist_matrix = pd.np.array([pd.np.zeros(leaf_ct)] * leaf_ct)
df = pd.DataFrame(dist_matrix, index=col_lbls, columns=col_lbls)
for node1_name, col in df.iteritems():
for node2_name in col.keys():
path = paths[node2_name].symmetric_difference(paths[node1_name])
dist = sum(node.dist for node in path)
df.at[node1_name, node2_name] = dist
df.at[node2_name, node1_name] = dist
return df
Note: This is a suboptimal solution for several reasons, but this question is not asking for the most most efficient solution. see this link for much more information about phylogenetic distance matrix methods.
This solution also uses pandas
which is overkill, since it is really just for the convenience of row/column labels. It would not be difficult to remove the pandas
dependencies and do it with native lists instead.
Here is the output:
>>> tree=ete3.Tree('((Species1_order1, (Species2_order2, Species3_order2)), Species4_order3, Species5_order5);')
>>> make_matrix(tree)
Species1_order1 Species2_order2 Species3_order2 Species4_order3 Species5_order5
Species1_order1 0.0 3.0 3.0 3.0 3.0
Species2_order2 3.0 0.0 2.0 4.0 4.0
Species3_order2 3.0 2.0 0.0 4.0 4.0
Species4_order3 3.0 4.0 4.0 0.0 2.0
Species5_order5 3.0 4.0 4.0 2.0 0.0
For the updates posted, I am not seeing anything wrong. It appears to give correct results. Here is the tree as rendered by ete3 (I highlighted the 4 hops that are counted in the distance from Interest_sequence
to Rhopalosiphum_maidis_Hemiptera
):
and here is the matrix column for Interest_sequence
that corresponds to it:
>>> m['Interest_sequence']
Rhopalosiphum_maidis__Hemiptera 4.0
Drosophila_novamexicana__Hemiptera 5.0
Drosophila_arizonae__Hemiptera 6.0
Drosophila_navojoa__Hemiptera 6.0
Interest_sequence 0.0
Heliothis_virescens_droso_3a__nan 5.0
Mythimna_separata_droso__nan 6.0
Heliothis_virescens_droso_3i__nan 6.0
Scaptodrosophila_lebanonensis__Diptera 5.0
Mythimna_unipuncta_droso_A__nan 6.0
Xestia_c-nigrum_droso__nan 8.0
Helicoverpa_armigera_droso__nan 8.0
Mocis_latipes_droso__nan 7.0
Drosophila_busckii__Diptera 4.0
Drosophila_bipectinata__Diptera 5.0
Drosophila_mojavensis__Diptera 7.0
Drosophila_yakuba__Diptera 7.0
Drosophila_hydei__Diptera 7.0
Drosophila_serrata__Diptera 8.0
Drosophila_takahashii__Diptera 9.0
Drosophila_eugracilis__Diptera 11.0
Drosophila_ficusphila__Diptera 11.0
Drosophila_erecta__Diptera 12.0
Drosophila_melanogaster__Diptera 13.0
Sequence_A_nan__nan 14.0
Drosophila_sechellia__Diptera 15.0
Drosophila_simulans__Diptera 15.0
Drosophila_suzukii__Diptera 12.0
Drosophila_biarmipes__Diptera 12.0
Name: Interest_sequence, dtype: float64