pandasdataframenumpyfastagff

Iterating through Pandas dataframe and dictionary items


here's a tough one.

Problem Introduction:

I'm working with two different files: a GFF3, which is basically a "9 columns" TSV, and a FASTA, which is a text file. Now, importing the GFF3 file with gffpandas it looks like this:

                     seq_id  source  type     start       end score strand phase      attributes
1      ctg.s1.000000F_arrow  prokka  gene      56.0     244.0     .      +     .  NHDIEHOJ_00001
3      ctg.s1.000000F_arrow  prokka  gene     902.0    2167.0     .      -     .  NHDIEHOJ_00002
5      ctg.s1.000001F_arrow  prokka  gene    2363.0    2905.0     .      -     .  NHDIEHOJ_00003
7      ctg.s1.000003F_arrow  prokka  gene    2916.0    3947.0     .      -     .  NHDIEHOJ_00004
9      ctg.s2.000000F_arrow  prokka  gene    4353.0    5174.0     .      +     .  NHDIEHOJ_00005

Dropping the seq_id column I got the following "values":

ctg.s1.000000F_arrow
ctg.s1.000001F_arrow
ctg.s1.000003F_arrow
ctg.s2.000000F_arrow

Now let's step to the FASTA file, which looks like this:

>ctg.s1.000000F_arrow
CCGGAACATCGTCCTCATCCGCAAAGTCGAGCTCTGCCTCGATCATTGCACGCGAATGGGTCAGCCGTCGGGCCCAACCG
GCATAGAGTGCGGACTGTCCGCCACCGGACTGCTCTATGGCGAGACGACGCTGCATTTCCGTTTCTGCCGCAATCAGGTC
>ctg.s1.000001F_arrow
ACGCCGGCTGCAACTTTGAGAAGATGTGGCGATGTCGACCGCTGCATCCCGCCCTTCTCTGCAGAATTTTCCAGCTGTCC
GAGGACATTGGCAAAAAGGCCCTTGGAATCCTTGCGGCTCATTCTCTCCCCCATGCCTTCCAGAAGAGGCCCTCGAGTTC
>ctg.s1.000003F_arrow
GGCGCTGGTTTTCCCCGACACCTCGCCGCGCGGCGAGGGCGTGGCTGACGACGAGGCTTATGATCTCGGTCAGGGTGCGG
GCTTCTATGTCAATGCGACGCAGAAGCCCTGGTCGCCGCACTATCGCATGTATGATTATATCGTCACCGAATTGCCCGCC
>ctg.s2.000000F_arrow
GCGCTCGACGGCATGCCCGTACGCGGCCGATCCTGCGCCGCTTCCTTAACCTTAGCTGCGGATGGAAAGTCGTCCTCGGA
GTTCGGCTCGCAAACGCTTTCGAGCGCGCAATTGACGACGATGTCGTACCCAACTTAGATCGCCGAACGCCATGAGGTCG

Assuming that the uppercased text part is much longer than two lines, as you can see, the text part characterised by ">" symbol presents the same values of seq_id GFF3 column. As a matter of fact I wrote few line to assign to the FASTA file a dictionary in which the "key" is the text part characterised by ">" symbol, the "item" is the uppercased part.

Problem processing

For each attributes value inside the dataframe there's a corresponding start and end value which is an interval of the corresponding seq_id. I'd like to extract from the FASTA file that exact interval but with respect to the seq_id value which refers to. I mean the the interval 56-244 must be searched only for the FASTA sequence of ctg.s1.000000F_arrow, as well as 902-2167. The final goal is to obtain a dataframe which has an additional 10th column (es. 'sequence') that contains the corresponding FASTA sequence of the interval, like this:

                     seq_id  source  type     start       end score strand phase      attributes    sequence
1      ctg.s1.000000F_arrow  prokka  gene      56.0     244.0     .      +     .  NHDIEHOJ_00001    CCGGAACATCGTCCTCATCCG
3      ctg.s1.000000F_arrow  prokka  gene     902.0    2167.0     .      -     .  NHDIEHOJ_00002    CAAGGACATCGTGATCAATTC
5      ctg.s1.000001F_arrow  prokka  gene    2363.0    2905.0     .      -     .  NHDIEHOJ_00003    TCGCCGCGCGGCGAGTGATTA
7      ctg.s1.000003F_arrow  prokka  gene    2916.0    3947.0     .      -     .  NHDIEHOJ_00004    TCGAGCGCGCAATTGACGACG
9      ctg.s2.000000F_arrow  prokka  gene    4353.0    5174.0     .      +     .  NHDIEHOJ_00005    AGATCGCCGAACGCCATATTT

N.B. The sequences in sequence have been randomly typed of the same length but will differ proportionally to the end - start dimension for each attributes value.

I hope I was clear in the explanation. Thank you everybody for the help.


Solution

  • Assuming df the DataFrame and dic the dictionary and the sequence indexing to be starting at 1 (not 0 like python indexing):

    df['sequence'] = [dic[k][int(i-1):int(j)] for k, i, j in
                      zip(df['seq_id'], df['start'], df['end'])]