pythonpandasporter-stemmer

Stemmed words to compute a frequency plot


I need to plot words frequency:

                2333
appartamento    321
casa            314
cè               54 
case             43
                ... 

However, there are some words having the same stem (then they have a similar meaning). In the example above, casa and case have the same meaning (the first is a singular, the second is a plural name, like house and houses). I read that this issues can be fixed by using nltk.stem. I have, therefore, tried as follows:

from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

train_df = (df['Frasi'].str.replace(r'[^\w\s]+', '').str.split(' ').value_counts())

porter = PorterStemmer()
lancaster=LancasterStemmer()

Now I should run a loop for each word in the list above, using porter and Lancaster, but I do not know how to use the list above to stem. Just to give you some context: the list above comes from phrases/sentences, saved into a dataframe. My dataframe has many columns, including a column Frasi where those words come from. An example of phrases included within that column is:

Frasi
Ho comprato un appartamento in centro
Il tuo appartamento è stupendo
Quanti vani ha la tua casa?
Il mercato immobiliare è in crisi
.... 

What I have tried to do is to clean the sentences, removing punctuation and stop words (but it seems spaces are still in, as shown from the word list above). Now I would need to use the information about words frequency to plot the top 10-20 words used, but excluding words with similar meaning or same stem. Should I specify all the suffixes or there is something that I can use to automatise the process?

Any help on this would be great.


Solution

  • Using NLTK

    Code

    import nltk                                 
    from nltk.tokenize import word_tokenize        # https://www.tutorialspoint.com/python_data_science/python_word_tokenization.htm
    from nltk.stem.snowball import SnowballStemmer # https://www.nltk.org/howto/stem.html
    from nltk.probability import FreqDist          # http://www.nltk.org/api/nltk.html?highlight=freqdist
    from nltk.corpus import stopwords              # https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
    
    def freq_dist(s, language):
        " Frequency count based upon language"
        # Language based stops words and stemmer
        fr_stopwords = stopwords.words(language) 
        fr_stemmer = SnowballStemmer(language) 
    
        # Language based tokenization
        words = word_tokenize(s, language = language)
    
        return FreqDist(fr_stemmer.stem(w) for w in words if w.isalnum() and not w in fr_stopwords)
    

    Explanation

    Initial data in Pandas DataFrame. Obtain French column as string.

    s = '\n'.join(df['French'].tolist())
    

    Function freq_dist above does the following upon its input string.

    Tokenize string based upon language

    words = word_tokenize(s, language='french')
    

    Remove Punctuation (i.e. " ? , . etc.)

    punctuation_removed = [w for w in words if w.isalnum()]
    

    Get French stopwords

    french_stopwords = set(stopwords.words('french')) # make set for faster lookup
    

    Remove stopwords

    without_stopwords = [w for w in punctuation_removed if not w in french_stopwords]
    

    Stem words (which also removes case)

    Get French Stemmer and apply stemmer

    french_stemmer = SnowballStemmer('french')
    

    Stem words

    stemmed_words = [french_stemmer.stem(w) for w in without_stopwords]
    

    Frequency distribution using FreqDist

    fDist = FreqDist(stemmed_words)
    

    Example

    DataFrame:

                                          French
    0               Ho comprato un appartamento in centro
    1                      Il tuo appartamento è stupendo
    2                         Quanti vani ha la tua casa?
    3                   Il mercato immobiliare è in crisi
    4                                     Qui vivra verra
    5                        L’habit ne fait pas le moine
    6                         Chacun voit midi à sa porte
    7                      Mieux vaut prévenir que guérir
    8                Petit a petit, l’oiseau fait son nid
    9   Qui court deux lievres a la fois, n’en prend a...
    10                           Qui n’avance pas, recule
    11  Quand on a pas ce que l’on aime, il faut aimer...
    12  Il n’y a pas plus sourd que celui qui ne veut ...
    

    Generate string

    sentences = '\n'.join(df['French'].tolist())
    

    Generate Word Count

    counts = freq_dist(sentences, 'french')
    

    Show in alphabetical order

    results = sorted(counts.most_common(), 
                     key=lambda x: x[0])
    for k, v in results:
        print(k, v)
    
    a 5
    aim 2
    appartamento 2
    aucun 1
    avanc 1
    cas 1
    celui 1
    centro 1
    chacun 1
    comprato 1
    court 1
    cris 1
    deux 1
    entendr 1
    fait 2
    faut 1
    fois 1
    guer 1
    ha 1
    hab 1
    ho 1
    il 3
    immobiliar 1
    in 2
    l 1
    lievr 1
    mercato 1
    mid 1
    mieux 1
    moin 1
    nid 1
    oiseau 1
    pet 2
    plus 1
    port 1
    prend 1
    préven 1
    quand 1
    quant 1
    qui 3
    recul 1
    sourd 1
    stupendo 1
    tu 1
    tuo 1
    van 1
    vaut 1
    verr 1
    veut 1
    vivr 1
    voit 1
    è 2