awkbioinformaticstrimfastadna-sequence

Trim first and last bases in fasta file


I have a circular mtDNA reference. After the alignment I would like to cut first and last 15 bases in my fasta files. How can I do that?

For example, this is my sequence and I need to take out first and last 15 letters. The first 15 characters will not always be the same for all the samples.

>ref|NC_005038.2| mitochondrion_circulized, complete genome
TTTACATGGTAAGTGGTTGATGTAGCTTAAACTTAAAGCAAGGCACTGAAAATGCCTAGATGAGTGTACC
AACTCCATAAACACATAGGTTTGGTCCCAGCCTTCCTGTTAACTCTCAACAGACTTACACATGCAAGCAT
CCACGCCCCGGTGAGTAACGCCCTCCAAATCAATAAGACTAAGAGGAGCAGGTATCAAGCACACATCTTG
TAGCTTATAACGCCTCGCTTAACCACACCCCTACGGGAGACAGCAGTGACAAAAATTAAGCCATAAACGA

The output I expect would be:

>ref|NC_005038.2| mitochondrion_circulized, complete genome
GTTGATGTAGCTTAAACTTAAAGCAAGGCACTGAAAATGCCTAGATGAGTGTACCAACTCCATAAACACA
TAGGTTTGGTCCCAGCCTTCCTGTTAACTCTCAACAGACTTACACATGCAAGCATCCACGCCCCGGTGAG
TAACGCCCTCCAAATCAATAAGACTAAGAGGAGCAGGTATCAAGCACACATCTTGTAGCTTATAACGCCT
CGCTTAACCACACCCCTACGGGAGACAGCAGTGACAAAAA

Solution

  • Here is one way to do it using Python's Biopython module:

    from Bio import SeqIO
    import textwrap #use it to wrap it to certain column size
    
    input_file   = open('input.fasta','r')
    output_file  = open('output.fasta', "w")
    
    for seq_record in SeqIO.parse(input_file, "fasta"):
        sequence = str(seq_record.seq)
        trimmed_sequence = sequence[15:-15]
        wrapped_sequene  = textwrap.fill(trimmed_sequence, width = 60)
        fasta_record = f'>{seq_record.id}\n{wrapped_sequene}'
        output_file.write(fasta_record)
        print(fasta_record) # will print on the screen
    
    input_file.close()
    output_file.close()