pythondataframenlppos-tagger

count the occurrences of POS tagging pattern


So I've applied POS tagging to one of the columns in my dataframe. For each sentence, I want to count the occurrences of this pattern: NNP, MD, VB.

For example, I have the following sentence: communications between the Principal and the Contractor shall be in the English language

The POS tagging would be: (communications, NNS), (between,IN), (the, DT), (Principal, NNP), (and, CC), (the, DT), (Contractor, NNP), (shall, MD), (be,VB), (in, DT), (the, DT), (English, JJ), (language, NN).

Notice that in the POS tagging result, the pattern (NNP, MD, VB) exists and occurs 1 time. I'd like to create a new column in the df for this number of occurrences.

Any ideas how I can do this?

Thanks in advance


Solution

  • A simple counter function would perform what you desired!

    Input:

    df = pd.DataFrame({'POS':['(communications, NNS), (between,IN), (the, DT), (Principal, NNP), (and, CC), (the, DT), (Contractor, NNP), (shall, MD), (be,VB), (in, DT), (the, DT), (English, JJ), (language, NN)', '(Contractor, NNP), (shall, MD), (be,VB), (communications, NNS), (between,IN), (the, DT), (Principal, NNP), (and, CC), (the, DT), (Contractor, NNP), (shall, MD), (be,VB), (in, DT), (the, DT), (English, JJ), (language, NN)', '(and, CC), (the, DT)']})
    

    Function:

    def counter(pos):
        words, tags = [], []
        for item in pos.split('), ('):
            temp = item.strip(' )(')
            word, tag = temp.split(',')[0], temp.split(',')[-1].strip()
            words.append(word); tags.append(tag)
        length = len(tags)
        if length<3:
            return 0
        count = 0
        for idx in range(length):
            if tags[idx:idx+3]==['NNP', 'MD', 'VB']:
                count+=1
        return count
    

    Output:

    df['occ'] = df['POS'].apply(counter)
    df
    
        POS     occ
    0   (communications, NNS), (between,IN), (the, DT)...   1
    1   (Contractor, NNP), (shall, MD), (be,VB), (comm...   2
    2   (and, CC), (the, DT)    0