I have the following pandas dataframe (it's a gff file):
df = pd.DataFrame.from_dict({'scaffold name': {0: 'Tname16C00001.1',
1: 'Tname16C00001.1',
2: 'Tname16C00001.1',
3: 'Tname16C00001.1',
4: 'Tname16C00001.1',
5: 'Tname16C00001.1',
6: 'Tname16C00001.1',
7: 'Tname16C00001.1',
8: 'Tname16C00001.1',
9: 'Tname16C00001.1'},
'source': {0: 'annotation',
1: 'feature',
2: 'feature',
3: 'feature',
4: 'feature',
5: 'feature',
6: 'feature',
7: 'feature',
8: 'feature',
9: 'feature'},
'type': {0: 'remark',
1: 'region',
2: 'gene',
3: 'CDS',
4: 'gene',
5: 'CDS',
6: 'gene',
7: 'CDS',
8: 'gene',
9: 'CDS'},
'start': {0: 1,
1: 1,
2: 943,
3: 943,
4: 1964,
5: 1964,
6: 2386,
7: 2386,
8: 3998,
9: 3998},
'stop': {0: 12018,
1: 12018,
2: 1992,
3: 1992,
4: 2179,
5: 2179,
6: 3897,
7: 3897,
8: 5152,
9: 5152},
'score': {0: '.',
1: '.',
2: '.',
3: '.',
4: '.',
5: '.',
6: '.',
7: '.',
8: '.',
9: '.'},
'strand': {0: '.',
1: '+',
2: '+',
3: '+',
4: '-',
5: '-',
6: '+',
7: '+',
8: '+',
9: '+'},
'phase': {0: '.',
1: '.',
2: '.',
3: '0',
4: '.',
5: '0',
6: '.',
7: '0',
8: '.',
9: '0'},
'attr': {0: 'accession=Tname16C00001.1;comment=Annotations were generated from the MicroScope annotation platform. Additional results are available at http://www.genoscope.cns.fr/agc/microscope . This file is not suitable for direct databank submission. To contact us: mage%40genoscope.cns.fr .%0AMicroscope genomic region coordinates: 1..12018;data_file_division=BCT;date=05-NOV-2019;organism=Genus Species Strain;source=Genus Species Strain;topology=linear',
1: 'Is_circular=false;Note=whole genome shotgun linear WGS contig 1;db_xref=taxon:1907535,MaGe/Organism_id:12155,MaGe/Species_code:Tname16,MaGe/Sequence_id:16744,MaGe/Scaffold_id:1,MaGe/Contig_id:1,MaGe/Contig_label:NNNNNODE_1_length_11870_cov_199.017943;mol_type=genomic DNA;organism=Candidatus Thiosymbion hypermnestrae;strain=Strain',
2: 'locus_tag=Tname16_v1_10001',
3: 'ID=71429338;db_xref=MaGe:71429338;inference=ab initio prediction:AMIGene:2.0;locus_tag=Tname16_v1_10001;note=Evidence 5 : Unknown function;product=protein of unknown function;transl_table=11;translation=MG',
4: 'locus_tag=Tname16_v1_10002',
5: 'ID=71429339;db_xref=MaGe:71429339;inference=ab initio prediction:AMIGene:2.0;locus_tag=Tname16_v1_10002;note=Evidence 5 : Unknown function;product=protein of unknown function;transl_table=11;translation=MI',
6: 'gene=wcaJ;locus_tag=Tname16_v1_10003',
7: 'ID=71429340;db_xref=MaGe:71429340;ec_number=2.7.8.31;gene=wcaJ;inference=ab initio prediction:AMIGene:2.0;locus_tag=Tname16_v1_10003;product=UDP-glucose:undecaprenyl-phosphate glucose-1-phosphate transferase;transl_table=11;translation=MY',
8: 'gene=rffE;locus_tag=Tname16_v1_10004',
9: 'ID=71429341;db_xref=MaGe:71429341;ec_number=5.1.3.14;function=1.6.4 : Enterobacterial common antigen %28surface glycolipid%29,6.1 : Membrane,6.3 : Surface antigens %28ECA%2C O antigen of LPS%29,7.1 : Cytoplasm;gene=rffE;inference=ab initio prediction:AMIGene:2.0;locus_tag=Tname16_v1_10004;note=Evidence 2a : Function from experimental evidences in other organisms%3B PubMedId 11106477%2C 7559340%2C 8170390%2C 8226648%3B Product type e : enzyme;product=UDP-N-acetyl glucosamine-2-epimerase;transl_table=11;translation=MT'}})
The values in the column attr
are actually additional columns, but the file format gff
doesn't allow that. I want to split the text in this column into multiple columns. The values are in a broader sense dictionaries, meaning that each key has a value, separated by =
(e.g. accession=Tname16C00001.1
), and each key-value pairs are separated by ;
.
I started by splitting each key-value pair into two columns per df
row:
s = df['attr'].str.split(';').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'attr'
del df['attr']
df.join(s)
df.join(s.apply(lambda x: pd.Series(x.split('='))))
This gives me the following df
, with duplicated row indeces:
scaffold name source type start stop score strand phase 0 1
0 Tname16C00001.1 annotation remark 1 12018 . . . accession Tname16C00001.1
0 Tname16C00001.1 annotation remark 1 12018 . . . comment Annotations were generated from the MicroScope...
0 Tname16C00001.1 annotation remark 1 12018 . . . data_file_division BCT
0 Tname16C00001.1 annotation remark 1 12018 . . . date 05-NOV-2019
0 Tname16C00001.1 annotation remark 1 12018 . . . organism Genus Species Strain
0 Tname16C00001.1 annotation remark 1 12018 . . . source Genus Species Strain
0 Tname16C00001.1 annotation remark 1 12018 . . . topology linear
1 Tname16C00001.1 feature region 1 12018 . + . Is_circular false
1 Tname16C00001.1 feature region 1 12018 . + . Note whole genome shotgun linear WGS contig 1
1 Tname16C00001.1 feature region 1 12018 . + . db_xref taxon:1907535,MaGe/Organism_id:12155,MaGe/Spec...
1 Tname16C00001.1 feature region 1 12018 . + . mol_type genomic DNA
1 Tname16C00001.1 feature region 1 12018 . + . organism Candidatus Thiosymbion hypermnestrae
1 Tname16C00001.1 feature region 1 12018 . + . strain Strain
2 Tname16C00001.1 feature gene 943 1992 . + . locus_tag Tname16_v1_10001
...
Now, how do I transpose and summarize each "key-value" pair in column 0
and 1
, respectively, for each index? All empty cells can have NaN (that will be many).
My desired output should be:
scaffold name source type start stop score strand phase accession comment data_file_division date organism source topology Is_circular Note db_xref mol_type organism strain locus_tag
0 Tname16C00001.1 annotation remark 1 12018 . . . Tname16C00001.1 Annotations were generated from the MicroScope... BCT 05.Nov.19 Genus Species Strain Genus Species Strain linear NaN NaN NaN NaN NaN NaN NaN
1 Tname16C00001.1 feature region 1 12018 . + . NaN NaN NaN NaN NaN NaN NaN FALSE whole genome shotgun linear WGS contig 1 taxon:1907535,MaGe/Organism_id:12155,MaGe/Spec... genomic DNA Candidatus Thiosymbion hypermnestrae Strain NaN
2 Tname16C00001.1 feature gene 943 1992 . + . NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Tname16_v1_10001
...
You can create list of dictionaries in list comprehension, then new DataFrame
with changed columns names by DataFrame.add_prefix
for avoid duplicated columns names if added to original by DataFrame.join
:
L = [dict([y.split('=') for y in x.split(';')]) for x in df.pop('attr')]
df = df.join(pd.DataFrame(L, index=df.index).add_prefix('attr.'))
print (df)
scaffold name source type start stop score strand phase \
0 Tname16C00001.1 annotation remark 1 12018 . . .
1 Tname16C00001.1 feature region 1 12018 . + .
2 Tname16C00001.1 feature gene 943 1992 . + .
3 Tname16C00001.1 feature CDS 943 1992 . + 0
4 Tname16C00001.1 feature gene 1964 2179 . - .
5 Tname16C00001.1 feature CDS 1964 2179 . - 0
6 Tname16C00001.1 feature gene 2386 3897 . + .
7 Tname16C00001.1 feature CDS 2386 3897 . + 0
8 Tname16C00001.1 feature gene 3998 5152 . + .
9 Tname16C00001.1 feature CDS 3998 5152 . + 0
attr.accession attr.comment ... \
0 Tname16C00001.1 Annotations were generated from the MicroScope... ...
1 NaN NaN ...
2 NaN NaN ...
3 NaN NaN ...
4 NaN NaN ...
5 NaN NaN ...
6 NaN NaN ...
7 NaN NaN ...
8 NaN NaN ...
9 NaN NaN ...
attr.locus_tag attr.ID attr.inference \
0 NaN NaN NaN
1 NaN NaN NaN
2 Tname16_v1_10001 NaN NaN
3 Tname16_v1_10001 71429338 ab initio prediction:AMIGene:2.0
4 Tname16_v1_10002 NaN NaN
5 Tname16_v1_10002 71429339 ab initio prediction:AMIGene:2.0
6 Tname16_v1_10003 NaN NaN
7 Tname16_v1_10003 71429340 ab initio prediction:AMIGene:2.0
8 Tname16_v1_10004 NaN NaN
9 Tname16_v1_10004 71429341 ab initio prediction:AMIGene:2.0
attr.note \
0 NaN
1 NaN
2 NaN
3 Evidence 5 : Unknown function
4 NaN
5 Evidence 5 : Unknown function
6 NaN
7 NaN
8 NaN
9 Evidence 2a : Function from experimental evide...
attr.product attr.transl_table \
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 protein of unknown function 11
4 NaN NaN
5 protein of unknown function 11
6 NaN NaN
7 UDP-glucose:undecaprenyl-phosphate glucose-1-p... 11
8 NaN NaN
9 UDP-N-acetyl glucosamine-2-epimerase 11
attr.translation attr.gene attr.ec_number \
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 MG NaN NaN
4 NaN NaN NaN
5 MI NaN NaN
6 NaN wcaJ NaN
7 MY wcaJ 2.7.8.31
8 NaN rffE NaN
9 MT rffE 5.1.3.14
attr.function
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 1.6.4 : Enterobacterial common antigen %28surf...
[10 rows x 30 columns]