pythonbioinformaticsbiopythongff

Bcbio-gff File creation issue


When creating a file using GFF.write(), i get a new line with "annotation remark" as a source, followed by ASCII encoding of sequence regions:

##gff-version 3
##sequence-region NC_011594.1 1 16779
NC_011594.1 annotation  remark  1   16779   .   .   .   gff-version=3;sequence-region=%28%27NC_011594.1%27%2C 0%2C 16971%29,%28%27NC_042493.1%27%2C 0%2C 132544852%29, (continues on and on)
NC_011594.1 RefSeq  gene    1   1531    .   +   .   Dbxref=GeneID:7055888;ID=gene-COX1;Name=COX1;gbkey=Gene;gene=COX1;gene_biotype=protein_coding

Any idea why it's here, what it's for and how i could avoid it? I fear it might become a problem when using it in third-party softwares.

I imported only the bcbio-gff package, but I believe it's part of Biopython, link: https://biopython.org/wiki/GFF_Parsing


Solution

  • To your first question - "Why it is there?"

    To your next question - "How can I avoid it?"

    Example:

    from Bio import SeqIO
    from BCBio import GFF
    
    g = SeqIO.read('NC_003888.3.gb','gb')
    
    g.annotations = {}
    
    with open('t2.gff', 'w') as f:
        GFF.write([g], f)
    

    Output file head - no # annotation remark

    head t2.gff 
    ##gff-version 3
    ##sequence-region NC_003888.3 1 8667507
    NC_003888.3 feature source  1   8667507 ... removed for clarity ....