Hello I'm trying to extract the coding sequences from a fasta file using a gff file with the help of biopython (https://biopython.org/wiki/GFF_Parsing)
I have tried doing what this tutorial describes but there is something I just don't seem to get right for some reason: when I iterate over the features of the sequence record only 'gff_type':'gene' is recognized.
Here is an example of what my gff file looks like:
As you can see, my file clearly contains gff_type='CDS' entries
But when I run a simple script like this:
for rec in GFF.parse(in_handle):
for feature in rec.features:
if feature.type == 'CDS':
print(feature)
No output is returned whatsoever. And when I run something like this:
for rec in GFF.parse(in_handle):
print(rec.features)
break
SeqFeature(FeatureLocation(ExactPosition(2951), ExactPosition(4284), strand=-1), type='gene', id='Csa1G000010'),
SeqFeature(FeatureLocation(ExactPosition(5473), ExactPosition(14089), strand=-1), type='gene', id='Csa1G000020'),
SeqFeature(FeatureLocation(ExactPosition(18683), ExactPosition(21806), strand=1), type='gene', id='Csa1G000030'),
...
There are 2 things I don't understand:
gff_type: {('CDS',):117359, ('exon',):120675, ('five_prime_utr',):16038, ('gene',):24274, ('mRNA',):24274, ('three_prime_utr',):15588}
I feel like the solution for me would be to iterate within each gene and extract the CDS that way but there is no SeqRecord.feature.feature attribute
There is however a SeqRecord.feature.sub_feature attribute and resulted in iteration over the mRNA in the gene (of which there is also only 1) but there is no sub_sub_feature (I checked with dir)
And so Im stuck with this problem that seems so incredibly simple. I know I could just iterate over the gff file as txt making use of the tab seperation but I am trying to get more acquainted with Biopython but to no avail (the Biopython documentation is also unfortunately quite bad). I hope someone knows how to set me on my way.
Kind Regards
The structure is gene -> mRNA -> exon | CDS | other sequences like 3'UTR. The exon or CDS are sub_features of mRNA, not sub_sub_features of gene.
See Examining your GFF file here: https://biopython.org/wiki/GFF_Parsing