I have a dataframe which contains the following data (only 3 samples are provided here):
data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implementation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}
# Create DataFrame
df = pd.DataFrame(data)
Basically, each row contains a tuple of top 10 words along with their frequencies in each of Department.
I wanted to create a dataframe where (let department name get repeated and) each row contains the word from tuple in one column and frequency count in the other columns so that it should look like something this:
Department Word Counts
D1 cat 6
D1 project 6
D1 dog 6
D1 develop 4
D1 smooth 4
D1 efficient 4
D1 administrative 4
D1 procedure 4
D1 establishment 3
D1 matter 3
D2 management 21
D2 satisfaction 12
D2 within 9
D2 budget 9
D2 township 9
Is there any work around this type of conversion?
First, use DataFrame.explode
to separate the list elements into different rows. Then split the tuples into different columns, e.g. using DataFrame.assign
+ Series.str
res = (
df.explode('TopWords', ignore_index=True)
.assign(Word=lambda df: df['TopWords'].str[0],
Counts=lambda df: df['TopWords'].str[1])
.drop(columns='TopWords')
)
Output:
>>> res
Department Word Counts
0 D1 cat 6
1 D1 project 6
2 D1 dog 6
3 D1 develop 4
4 D1 smooth 4
5 D1 efficient 4
6 D1 administrative 4
7 D1 procedure 4
8 D1 establishment 3
9 D1 matter 3
10 D2 management 21
11 D2 satisfaction 12
12 D2 within 9
13 D2 budget 9
14 D2 township 9
15 D2 site 9
16 D2 periodic 9
17 D2 admin 9
18 D2 maintain 9
19 D2 guest 6
20 D3 manage 2
21 D3 ir 2
22 D3 mines 2
23 D3 implementation 2
24 D3 clrg 2
25 D3 act 2
26 D3 implementations 2
27 D3 office 2
28 D3 maintenance 2
29 D3 administration 2
As @sammywemmy suggested, if you are dealing with a considerable amount of data, it will be faster if you wrangle it before loading it into a DataFrame.
Another way of doing it using a nested loop
data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implementation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}
records = []
for idx, top_words_list in enumerate(data['TopWords']):
for word, count in top_words_list:
rec = {
'Department': data['Department'][idx],
'Word': word,
'Count': count
}
records.append(rec)
res = pd.DataFrame(records)