pythonpandasnlpspacypart-of-speech

Create a vocabulary with pos


I would like to create a list of semantic entities (nouns, verbs, punct, etc.) using pos tagging. I am currently running the following code

import spacy
import pandas as pd
    
nlp = spacy.load('en_core_web_sm',disable=['ner','textcat'])

def fun(text):
    doc = nlp(text)
    pos = ""
    for token in doc:
        pos += token.pos_ + " "
    return pos

df['S']= df.Text.apply(fun)

to create the structure of sentences. So, for example, if I have the column Text (see below), this code generate the column S which contains all the information about semantic structure:

Text                                                S
0   “I will meet quite a few people, it’s well...   PUNCT NOUN VERB VERB DET DET ADJ NOUN PUNCT PR...
1   Says “Cristiano Ronaldo’s family still owns”... VERB PUNCT PROPN PROPN PART NOUN ADV VERB PUNC...
2   Joe Biden plagiarized Donald Trump in his... PROPN PROPN VERB PROPN PROPN ADP DET PROP...

I am wondering if I can create a vocabulary of nouns, verbs, det, adj, ... by editing the code above or if I need to consider a different approach. To take all the entities (nouns, verbs,...) in the dataframe, I would look at selecting only unique values, in order to creat a list for each of them.

Example of output (it can be also in lists rather than in a dataframe)

PUNCT      NOUN        VERB         ....
“           I          will 
,          people      meet
”          family      says
                       owns
                      plagiarized

Solution

  • You can try:

    import spacy
    import pandas as pd
    nlp = spacy.load('en_core_web_sm',disable=['ner','textcat'])
    
    texts = ['"I will meet quite a few people, it\'s well', 
             'Says "Cristiano Ronaldo\'s family still owns"',
             'Joe Biden plagiarized Donald Trump in his...']
    
    df = pd.DataFrame({"Text":texts})
    
    d = dict()
    def func(text):
        doc = nlp(text)
        for tok in doc:
            if tok.pos_ not in d:
                d[tok.pos_] = [tok.text]
            else:
                d[tok.pos_].append(tok.text)
                
    df.Text.apply(func)
    
    pprint(d)
    

    {'ADJ': ['few'],
     'ADP': ['in'],
     'ADV': ['well', 'still'],
     'AUX': ["'s"],
     'DET': ['quite', 'a', 'his'],
     'NOUN': ['people', 'family'],
     'PART': ["'s"],
     'PRON': ['I', 'it'],
     'PROPN': ['Cristiano', 'Ronaldo', 'Joe', 'Biden', 'Donald', 'Trump'],
     'PUNCT': ['"', ',', '"', '"', '...'],
     'VERB': ['will', 'meet', 'Says', 'owns', 'plagiarized']}
    

    Note, you don't need pandas dependence at all:

    docs = nlp.pipe(texts)
    d = dict()
    for doc in docs:
        for tok in doc:
            if tok.pos_ not in d:
                d[tok.pos_] = [tok.text]
            else:
                d[tok.pos_].append(tok.text)
    pprint(d)
    

    {'ADJ': ['few'],
     'ADP': ['in'],
     'ADV': ['well', 'still'],
     'AUX': ["'s"],
     'DET': ['quite', 'a', 'his'],
     'NOUN': ['people', 'family'],
     'PART': ["'s"],
     'PRON': ['I', 'it'],
     'PROPN': ['Cristiano', 'Ronaldo', 'Joe', 'Biden', 'Donald', 'Trump'],
     'PUNCT': ['"', ',', '"', '"', '...'],
     'VERB': ['will', 'meet', 'Says', 'owns', 'plagiarized']}
    

    These will collect all the tokens under their POS.

    If you only need list of unique tokens:

    texts = ['"I will will meet quite a few people, it\'s well', 
             'Says "Cristiano Ronaldo\'s family still owns"',
             'Joe Biden plagiarized Donald Trump in his...']
    
    docs = nlp.pipe(texts)
    d = dict()
    for doc in docs:
        for tok in doc:
            if tok.pos_ not in d:
                d[tok.pos_] = [tok.text]
            elif tok.text not in d[tok.pos_]:
                d[tok.pos_].append(tok.text)
    pprint(d)
    

    {'ADJ': ['few'],
     'ADP': ['in'],
     'ADV': ['well', 'still'],
     'AUX': ["'s"],
     'DET': ['quite', 'a', 'his'],
     'NOUN': ['people', 'family'],
     'PART': ["'s"],
     'PRON': ['I', 'it'],
     'PROPN': ['Cristiano', 'Ronaldo', 'Joe', 'Biden', 'Donald', 'Trump'],
     'PUNCT': ['"', ',', '...'],
     'VERB': ['will', 'meet', 'Says', 'owns', 'plagiarized']}