[SOLVED] extract pos_tag_sents from pandas series

extract pos_tag_sents from pandas series

following the advice from the thread How to apply pos_tag_sents() to pandas dataframe efficiently I run the code to identify different pos for the text in one of my variables.

Now that I managed to create the column of interest - sub['POS'] - how do I extract my relevant information - all the NN - and create a column for each of them?

print(sub['POS'])

5     [(e-mail, JJ), (new, JJ), (delhi, NN), ((, (),...
4     [(bangladesh, JJ), (garment, NN), (unions, NNS...
41    [(listen, VB), (blaze, NN), (wrecks, NNS), (te...
10    [(11:49, CD), (am, VBP), (,, ,), (september, V...
17    [(listen, JJ), (two, CD), (events, NNS), (plan...

as an output, I would like a new column (here as 'NN'), that contains all the NN for each row.

df = pd.DataFrame(["delhi", 
                   "garment" , 
                   "blaze", 
                   NaN], columns=['NN'])

Solution

So I am assuming you have one column in the dataframe where each row is a list of tuples. Please correct me if I am wrong. From that column you want to create new columns for each 'Tag'. Do you think following is what will achieve what you want to do?

import pandas as pd
import numpy as np

df = pd.DataFrame({"line":[[('e-mail', 'JJ'), ('new', 'JJ'), ('delhi', 'NN')]]})

def extract_pos(line,pos):
    return [word[0] for word in line if word[1] == pos]

df['NN'] = [extract_pos(line,'NN') for line in df['line']]
df['JJ'] = [extract_pos(line,'JJ') for line in df['line']]

This way you can add many column as you want and the result might look as some thing like following.

Hope this helps, Cheers