I have a pandas dataframe that consists of an id and an associated count of different encoded words. For instance:
Original = pd.DataFrame(data=[[1,'1:2,2:3,3:1'],[2,'2:2,4:3']], columns=['id','words'])
I have a dictionary that has the mapping to the actual words, for instance:
WordDict = {1:'A',2:'B',3:'C',4:'D'}
What I would like to do is create a new dataframe that maps the counts to columns for all possible words, so it would look something like:
Final =pd.DataFrame(data=[[1,2,3,1,0],[2,0,2,0,3]], columns=['id','A','B','C','D']).set_index('id')
I know I can split the 'words' column of the original into separate columns, and can create a dataframe from WordDict so that it has all possible columns, but could not figure out how to create the mapping.
You can use a regex, a list comprehension, and the DataFrame
constructor:
import re
Final = pd.DataFrame([{WordDict.get(int(k), None): v
for k,v in re.findall('([^:,]+):([^:,]+)', s)}
for s in Original['words']], index=Original['id']
).fillna(0).astype(int)
Or with split
:
Final = pd.DataFrame([{WordDict.get(int(k), None): v
for x in s.split(',')
for k,v in [x.split(':')]}
for s in Original['words']], index=Original['id']
).fillna(0).astype(int)
Or ast.literal_eval
:
from ast import literal_eval
Final = pd.DataFrame([{WordDict.get(k, None): v for k,v in literal_eval(f'{{{s}}}').items()}
for s in Original['words']], index=Original['id']
).fillna(0, downcast='infer')
Output:
A B C D
id
1 2 3 1 0
2 0 2 0 3