pythonpandasmodulesentiment-analysisvader

How to create a function that scores ngrams before unigrams in Python?


Let's assume I would like to score text with a dictionary called dictionary:

text = "I would like to reduce carbon emissions"

dictionary = pd.DataFrame({'text': ["like","reduce","carbon","emissions","reduce carbon emissions"],'score': [1,-1,-1,-1,1]})

I would like to write a function that adds up every term in dictionary that is in text. However, such a rule must have a nuance: prioritizing ngrams over unigrams.

Concretely, if I sum up the unigrams in dictionary that are in text, I get: 1+(-1)+(-1)+(-1)=-2 since like =1, reduce=-1, carbon =-1,emissions=-1. This is not what I want. The function must say the following things:

  1. consider first ngrams (reduce carbon emissions in the example), if there the set of ngrams is not empty, then attribute the corresponding value to it, otherwise if the the set of ngrams is empty then consider unigrams;
  2. if the set of ngrams is non-empty, ignore those single words (unigrams) that are in the selected ngrams (e.g. ignore "reduce", "carbon" and "emissions" that are already in "reduce carbon emissions").

Such a function should give me this output: +2 since like =1 + reduce carbon emissions = 1.

I am pretty new to Python and I am stuck. Can anyone help me with this?

Thanks!


Solution

  • I would sort the keywords descendingly by length, so it's guarantee that re would match ngrams before one-gram:

    import re
    
    pat = '|'.join(sorted(dictionary.text, key=len, reverse=True))
    
    found = re.findall(fr'\b({pat})\b', text)
    

    Output:

    ['like', 'reduce carbon emissions']
    

    To get the expected output:

    scores = dictionary.set_index('text')['score']
    
    scores.re_index(found).sum()