I have a .fasta file with multiple genes in it. They all have a similar description such as this:
>lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA] [locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS]
I am trying to extract the gene starting location for all of these genes (ie. "1" from the example above). I have tried the following code but it doesn't seem to be working.
from Bio import SeqIO
genes = fasta_file.fasta
records = SeqIO.parse(open(genes), 'fasta')
record = next(records)
parts = record.description.split("..")
print(parts[0])
Any help or resources would be appreciated!
This worked for me. Hope this help.
import re
from Bio import SeqIO
genes = "fasta_file.fasta"
records = SeqIO.parse(genes, 'fasta')
# fasta_file.fasta file has this line only.
>lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA] [locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS]
You can get records with SeqIO.parse(filename, "fasta)
.
To check this,
for record in SeqIO.parse(genes, 'fasta'):
print(record)
gives below. And record.description
has the string info.
ID: lcl|NZ_LN831034.1_cds_WP_002987659.1_1 Name: lcl|NZ_LN831034.1_cds_WP_002987659.1_1 Description: lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA] [locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS] Number of features: 0 Seq('', SingleLetterAlphabet())
Get number after "location=" with regex.
ma = re.search("location=(\d+)\.\.\d+", record.description)
ma.groups()[0] # 1