here's a tough one.
Problem Introduction:
I'm working with two different files: a GFF3, which is basically a "9 columns" TSV, and a FASTA, which is a text file.
Now, importing the GFF3 file with gffpandas
it looks like this:
seq_id source type start end score strand phase attributes
1 ctg.s1.000000F_arrow prokka gene 56.0 244.0 . + . NHDIEHOJ_00001
3 ctg.s1.000000F_arrow prokka gene 902.0 2167.0 . - . NHDIEHOJ_00002
5 ctg.s1.000001F_arrow prokka gene 2363.0 2905.0 . - . NHDIEHOJ_00003
7 ctg.s1.000003F_arrow prokka gene 2916.0 3947.0 . - . NHDIEHOJ_00004
9 ctg.s2.000000F_arrow prokka gene 4353.0 5174.0 . + . NHDIEHOJ_00005
Dropping the seq_id
column I got the following "values":
ctg.s1.000000F_arrow
ctg.s1.000001F_arrow
ctg.s1.000003F_arrow
ctg.s2.000000F_arrow
Now let's step to the FASTA file, which looks like this:
>ctg.s1.000000F_arrow
CCGGAACATCGTCCTCATCCGCAAAGTCGAGCTCTGCCTCGATCATTGCACGCGAATGGGTCAGCCGTCGGGCCCAACCG
GCATAGAGTGCGGACTGTCCGCCACCGGACTGCTCTATGGCGAGACGACGCTGCATTTCCGTTTCTGCCGCAATCAGGTC
>ctg.s1.000001F_arrow
ACGCCGGCTGCAACTTTGAGAAGATGTGGCGATGTCGACCGCTGCATCCCGCCCTTCTCTGCAGAATTTTCCAGCTGTCC
GAGGACATTGGCAAAAAGGCCCTTGGAATCCTTGCGGCTCATTCTCTCCCCCATGCCTTCCAGAAGAGGCCCTCGAGTTC
>ctg.s1.000003F_arrow
GGCGCTGGTTTTCCCCGACACCTCGCCGCGCGGCGAGGGCGTGGCTGACGACGAGGCTTATGATCTCGGTCAGGGTGCGG
GCTTCTATGTCAATGCGACGCAGAAGCCCTGGTCGCCGCACTATCGCATGTATGATTATATCGTCACCGAATTGCCCGCC
>ctg.s2.000000F_arrow
GCGCTCGACGGCATGCCCGTACGCGGCCGATCCTGCGCCGCTTCCTTAACCTTAGCTGCGGATGGAAAGTCGTCCTCGGA
GTTCGGCTCGCAAACGCTTTCGAGCGCGCAATTGACGACGATGTCGTACCCAACTTAGATCGCCGAACGCCATGAGGTCG
Assuming that the uppercased text part is much longer than two lines, as you can see, the text part characterised by ">" symbol presents the same values of seq_id
GFF3 column.
As a matter of fact I wrote few line to assign to the FASTA file a dictionary in which the "key" is the text part characterised by ">" symbol, the "item" is the uppercased part.
Problem processing
For each attributes
value inside the dataframe there's a corresponding start
and end
value which is an interval of the corresponding seq_id
. I'd like to extract from the FASTA file that exact interval but with respect to the seq_id
value which refers to. I mean the the interval 56-244 must be searched only for the FASTA sequence of ctg.s1.000000F_arrow
, as well as 902-2167.
The final goal is to obtain a dataframe which has an additional 10th column (es. 'sequence') that contains the corresponding FASTA sequence of the interval, like this:
seq_id source type start end score strand phase attributes sequence
1 ctg.s1.000000F_arrow prokka gene 56.0 244.0 . + . NHDIEHOJ_00001 CCGGAACATCGTCCTCATCCG
3 ctg.s1.000000F_arrow prokka gene 902.0 2167.0 . - . NHDIEHOJ_00002 CAAGGACATCGTGATCAATTC
5 ctg.s1.000001F_arrow prokka gene 2363.0 2905.0 . - . NHDIEHOJ_00003 TCGCCGCGCGGCGAGTGATTA
7 ctg.s1.000003F_arrow prokka gene 2916.0 3947.0 . - . NHDIEHOJ_00004 TCGAGCGCGCAATTGACGACG
9 ctg.s2.000000F_arrow prokka gene 4353.0 5174.0 . + . NHDIEHOJ_00005 AGATCGCCGAACGCCATATTT
N.B. The sequences in sequence
have been randomly typed of the same length but will differ proportionally to the end
- start
dimension for each attributes
value.
I hope I was clear in the explanation. Thank you everybody for the help.
Assuming df
the DataFrame and dic
the dictionary and the sequence indexing to be starting at 1 (not 0 like python indexing):
df['sequence'] = [dic[k][int(i-1):int(j)] for k, i, j in
zip(df['seq_id'], df['start'], df['end'])]