I have a gff file looks like this:
contig1 loci gene 452050 453069 15 - . ID=dd_g4_1G94;
contig1 loci mRNA 452050 453069 14 - . ID=dd_g4_1G94.1;Parent=dd_g4_1G94
contig1 loci exon 452050 452543 . - . ID=dd_g4_1G94.1.exon1;Parent=dd_g4_1G94.1
contig1 loci exon 452592 453069 . - . ID=dd_g4_1G94.1.exon2;Parent=dd_g4_1G94.1
contig1 loci mRNA 452153 453069 15 - . ID=dd_g4_1G94.2;Parent=dd_g4_1G94
contig1 loci exon 452153 452543 . - . ID=dd_g4_1G94.2.exon1;Parent=dd_g4_1G94.2
contig1 loci exon 452592 452691 . - . ID=dd_g4_1G94.2.exon2;Parent=dd_g4_1G94.2
contig1 loci exon 452729 453069 . - . ID=dd_g4_1G94.2.exon3;Parent=dd_g4_1G94.2
I wish to rename the ID names, starting from 0001, such that for the above gene the entry is:
contig1 loci gene 452050 453069 15 - . ID=dd_0001;
contig1 loci mRNA 452050 453069 14 - . ID=dd_0001.1;Parent=dd_0001
contig1 loci exon 452050 452543 . - . ID=dd_0001.1.exon1;Parent=dd_0001.1
contig1 loci exon 452592 453069 . - . ID=dd_0001.1.exon2;Parent=dd_0001.1
contig1 loci mRNA 452153 453069 15 - . ID=dd_0001.2;Parent=dd_g4_1G94
contig1 loci exon 452153 452543 . - . ID=dd_0001.2.exon1;Parent=dd_0001.2
contig1 loci exon 452592 452691 . - . ID=dd_0001.2.exon2;Parent=dd_0001.2
contig1 loci exon 452729 453069 . - . ID=dd_0001.2.exon3;Parent=dd_0001.2
The above example is simply for one gene entry, but I wish to rename all genes, and their corresponding mRNA/exon, consecutively starting from ID = dd_0001. Any hints on how to do this would be much appreciated.
The file needs to be opened, then the id replaced line by line.
Here is the docs reference for file I/O and str.replace().
gff_filename = 'filename.gff'
replace_string = 'dd_g4_1G94'
replace_with = 'dd_0001'
lines = []
with open(gff_filename, 'r') as gff_file:
for line in gff_file:
line = line.replace(replace_string, replace_with)
with open(gff_filename, 'w') as gff_file:
Tested in Windows 10, Python 3.5.1, this works.
To search for ids, you should use regex.
import re
gff_filename = 'filename.gff'
replace_with = 'dd_{}'
re_pattern = r'ID=(.*?)[;.]'
ids = []
lines = []
with open(gff_filename, 'r') as gff_file:
file_lines = [line for line in gff_file]
for line in file_lines:
matches = re.findall(re_pattern, line)
for found_id in matches:
if found_id not in ids:
for line in file_lines:
for ID in ids:
if ID in line:
id_suffix = str(ids.index(ID)).zfill(4)
line = line.replace(ID, replace_with.format(id_suffix))
with open(gff_filename, 'w') as gff_file:
There are other ways of doing this, but this is quite robust.