bioinformaticstaxonomyphylogenyncbietetoolkit

How to get taxonomic specific ids for kingdom, phylum, class, order, family, genus and species from taxid?


I have a list of taxids that looks like this:

1204725
2162
1300163
420247

I am looking to get a file with taxonomic ids in order from the taxids above:

kingdom_id      phylum_id       class_id        order_id        family_id       genus_id        species_id   

I am using the package "ete3". I use the tool ete-ncbiquery that tells you the lineage from the ids above. (I run it from my linux laptop with the command below)

ete3 ncbiquery --search 1204725 2162 13000163 420247 --info 

The result looks like this:

# Taxid Sci.Name    Rank    Named Lineage   Taxid Lineage
2162    Methanobacterium formicicum species root,cellular organisms,Archaea,Euryarchaeota,Methanobacteria,Methanobacteriales,Methanobacteriaceae,Methanobacterium,Methanobacterium formicicum   1,131567,2157,28890,183925,2158,2159,2160,2162
1204725 Methanobacterium formicicum DSM 3637    no rank root,cellular organisms,Archaea,Euryarchaeota,Methanobacteria,Methanobacteriales,Methanobacteriaceae,Methanobacterium,Methanobacterium formicicum,Methanobacterium formicicum DSM 3637  1,131567,2157,28890,183925,2158,2159,2160,2162,1204725
420247  Methanobrevibacter smithii ATCC 35061   no rank root,cellular organisms,Archaea,Euryarchaeota,Methanobacteria,Methanobacteriales,Methanobacteriaceae,Methanobrevibacter,Methanobrevibacter smithii,Methanobrevibacter smithii ATCC 350611,131567,2157,28890,183925,2158,2159,2172,2173,420247

I have no idea which items (IDS) correspond to what I am looking for (if any)


Solution

  • The following code:

    import csv
    from ete3 import NCBITaxa
    
    ncbi = NCBITaxa()
    
    def get_desired_ranks(taxid, desired_ranks):
        lineage = ncbi.get_lineage(taxid)
        lineage2ranks = ncbi.get_rank(lineage)
        ranks2lineage = dict((rank, taxid) for (taxid, rank) in lineage2ranks.items())
        return {'{}_id'.format(rank): ranks2lineage.get(rank, '<not present>') for rank in desired_ranks}
    
    def main(taxids, desired_ranks, path):
        with open(path, 'w') as csvfile:
            fieldnames = ['{}_id'.format(rank) for rank in desired_ranks]
            writer = csv.DictWriter(csvfile, delimiter='\t', fieldnames=fieldnames)
            writer.writeheader()
            for taxid in taxids:
                writer.writerow(get_desired_ranks(taxid, desired_ranks))
    
    if __name__ == '__main__':
        taxids = [1204725, 2162,  1300163, 420247]
        desired_ranks = ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
        path = 'taxids.csv'
        main(taxids, desired_ranks, path)
    

    Produces a file that looks like this:

    kingdom_id  phylum_id   class_id    order_id    family_id   genus_id    species_id
    <not present>   28890   183925  2158    2159    2160    2162
    <not present>   28890   183925  2158    2159    2160    2162
    <not present>   28890   183925  2158    2159    2160    2162
    <not present>   28890   183925  2158    2159    2172    2173