Renaming Name ID in gffile.

I have a gff file looks like this:

contig1 loci    gene    452050  453069  15  -   .   ID=dd_g4_1G94;
contig1 loci    mRNA    452050  453069  14  -   .   ID=dd_g4_1G94.1;Parent=dd_g4_1G94
contig1 loci    exon    452050  452543  .   -   .   ID=dd_g4_1G94.1.exon1;Parent=dd_g4_1G94.1
contig1 loci    exon    452592  453069  .   -   .   ID=dd_g4_1G94.1.exon2;Parent=dd_g4_1G94.1
contig1 loci    mRNA    452153  453069  15  -   .   ID=dd_g4_1G94.2;Parent=dd_g4_1G94
contig1 loci    exon    452153  452543  .   -   .   ID=dd_g4_1G94.2.exon1;Parent=dd_g4_1G94.2
contig1 loci    exon    452592  452691  .   -   .   ID=dd_g4_1G94.2.exon2;Parent=dd_g4_1G94.2
contig1 loci    exon    452729  453069  .   -   .   ID=dd_g4_1G94.2.exon3;Parent=dd_g4_1G94.2
###

I wish to rename the ID names, starting from 0001, such that for the above gene the entry is:

contig1 loci    gene    452050  453069  15  -   .   ID=dd_0001;
contig1 loci    mRNA    452050  453069  14  -   .   ID=dd_0001.1;Parent=dd_0001
contig1 loci    exon    452050  452543  .   -   .   ID=dd_0001.1.exon1;Parent=dd_0001.1
contig1 loci    exon    452592  453069  .   -   .   ID=dd_0001.1.exon2;Parent=dd_0001.1
contig1 loci    mRNA    452153  453069  15  -   .   ID=dd_0001.2;Parent=dd_g4_1G94
contig1 loci    exon    452153  452543  .   -   .   ID=dd_0001.2.exon1;Parent=dd_0001.2
contig1 loci    exon    452592  452691  .   -   .   ID=dd_0001.2.exon2;Parent=dd_0001.2
contig1 loci    exon    452729  453069  .   -   .   ID=dd_0001.2.exon3;Parent=dd_0001.2

The above example is simply for one gene entry, but I wish to rename all genes, and their corresponding mRNA/exon, consecutively starting from ID = dd_0001. Any hints on how to do this would be much appreciated.

Solution

The file needs to be opened, then the id replaced line by line.
Here is the docs reference for file I/O and str.replace().

gff_filename = 'filename.gff'
replace_string = 'dd_g4_1G94'
replace_with = 'dd_0001'

lines = []
with open(gff_filename, 'r') as gff_file:
    for line in gff_file:
        line = line.replace(replace_string, replace_with)
        lines.append(line)

with open(gff_filename, 'w') as gff_file:
    gff_file.writelines(lines)

Tested in Windows 10, Python 3.5.1, this works.

To search for ids, you should use regex.

import re

gff_filename = 'filename.gff'
replace_with = 'dd_{}'
re_pattern = r'ID=(.*?)[;.]'

ids  = []
lines = []
with open(gff_filename, 'r') as gff_file:
    file_lines = [line for line in gff_file]

for line in file_lines:
    matches = re.findall(re_pattern, line)
    for found_id in matches:
        if found_id not in ids:
            ids.append(found_id)

for line in file_lines:
    for ID in ids:
        if ID in line:
            id_suffix = str(ids.index(ID)).zfill(4)
            line = line.replace(ID, replace_with.format(id_suffix))
    lines.append(line)

with open(gff_filename, 'w') as gff_file:
    gff_file.writelines(lines)

There are other ways of doing this, but this is quite robust.