pythonbioinformaticsbiopythonncbi

Gene Protein Sequence Database


I am wondering if there is a way to download or retrieve all the protein sequences of Genes from NCBI. I have the lots of GeneIDs I would like to iterate and retrieve their protein sequence.

Is there a package I use for this or link to the protein sequence of Genes from NCBI?


Solution

  • If I understand correctly what you want, you can download data directly from the NCBI website. Searching for 'protein sequences of genes' it returns 45260 records, which can be downloaded by clicking send to (right-top corner) and saving as a file. Check here. After downloading, you can simply load data from a file.

    If you were asking about programmatically downloading data, you can use this FTP, download latest data, unpack and find what you were looking for, filtering by GeneID. Most of these files are updated daily. You can read more here and based on this choose which file contains the data you need. As far as I'm concerned, you would need either gene2accession.gz or gene2refseq.gz