pythonbiopythongenetics

Extracting gene starting location from .fasta gene in python using Biopython


I have a .fasta file with multiple genes in it. They all have a similar description such as this:

>lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA] [locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS]

I am trying to extract the gene starting location for all of these genes (ie. "1" from the example above). I have tried the following code but it doesn't seem to be working.

from Bio import SeqIO
genes = fasta_file.fasta
records = SeqIO.parse(open(genes), 'fasta')
record = next(records)
parts = record.description.split("..")
print(parts[0])

Any help or resources would be appreciated!


Solution

  • This worked for me. Hope this help.

    import re
    from Bio import SeqIO
    
    genes = "fasta_file.fasta"
    records = SeqIO.parse(genes, 'fasta')
    
    # fasta_file.fasta file has this line only.
    >lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA] [locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS]
    

    You can get records with SeqIO.parse(filename, "fasta). To check this,

    for record in SeqIO.parse(genes, 'fasta'):
        print(record)
    

    gives below. And record.description has the string info.

    ID: lcl|NZ_LN831034.1_cds_WP_002987659.1_1 Name: lcl|NZ_LN831034.1_cds_WP_002987659.1_1 Description: lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA] [locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS] Number of features: 0 Seq('', SingleLetterAlphabet())

    Get number after "location=" with regex.

    ma = re.search("location=(\d+)\.\.\d+", record.description)
    ma.groups()[0] # 1