python-2.7nlpnltkstemminglemmatization

Lemmatizing Italian sentences for frequency counting


I would like to lemmatize some Italian text in order to perform some frequency counting of words and further investigations on the output of this lemmatized content.

I am preferring lemmatizing than stemming because I could extract the word meaning from the context in the sentence (e.g. distinguish between a verb and a noun) and obtain words that exist in the language, rather than roots of those words that don't usually have a meaning.

I found out this library called pattern (pip2 install pattern) that should complement nltk in order to perform lemmatization of the Italian language, however I am not sure the approach below is correct because each word is lemmatized by itself, not in the context of a sentence.

Probably I should give pattern the responsibility to tokenize a sentence (so also annotating each word with the metadata regarding verbs/nouns/adjectives etc), then retrieving the lemmatized word, but I am not able to do this and I am not even sure it is possible at the moment?

Also: in Italian some articles are rendered with an apostrophe so for example "l'appartamento" (in English "the flat") is actually 2 words: "lo" and "appartamento". Right now I am not able to find a way to split these 2 words with a combination of nltk and pattern so then I am not able to count the frequency of the words in the correct way.

import nltk
import string
import pattern

# dictionary of Italian stop-words
it_stop_words = nltk.corpus.stopwords.words('italian')
# Snowball stemmer with rules for the Italian language
ita_stemmer = nltk.stem.snowball.ItalianStemmer()

# the following function is just to get the lemma
# out of the original input word (but right now
# it may be loosing the context about the sentence
# from where the word is coming from i.e.
# the same word could either be a noun/verb/adjective
# according to the context)
def lemmatize_word(input_word):
    in_word = input_word#.decode('utf-8')
    # print('Something: {}'.format(in_word))
    word_it = pattern.it.parse(
        in_word, 
        tokenize=False,  
        tag=False,  
        chunk=False,  
        lemmata=True 
    )
    # print("Input: {} Output: {}".format(in_word, word_it))
    the_lemmatized_word = word_it.split()[0][0][4]
    # print("Returning: {}".format(the_lemmatized_word))
    return the_lemmatized_word

it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."

# 1st tokenize the sentence(s)
word_tokenized_list = nltk.tokenize.word_tokenize(it_string)
print("1) NLTK tokenizer, num words: {} for list: {}".format(len(word_tokenized_list), word_tokenized_list))

# 2nd remove punctuation and everything lower case
word_tokenized_no_punct = [string.lower(x) for x in word_tokenized_list if x not in string.punctuation]
print("2) Clean punctuation, num words: {} for list: {}".format(len(word_tokenized_no_punct), word_tokenized_no_punct))

# 3rd remove stop words (for the Italian language)
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
print("3) Clean stop-words, num words: {} for list: {}".format(len(word_tokenized_no_punct_no_sw), word_tokenized_no_punct_no_sw))

# 4.1 lemmatize the words
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lemmatize_word(x) for x in word_tokenized_no_punct_no_sw]
print("4.1) lemmatizer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_lemmatized), word_tokenize_list_no_punct_lc_no_stowords_lemmatized))

# 4.2 snowball stemmer for Italian
word_tokenize_list_no_punct_lc_no_stowords_stem = [ita_stemmer.stem(i) for i in word_tokenized_no_punct_no_sw]
print("4.2) stemmer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_stem), word_tokenize_list_no_punct_lc_no_stowords_stem))

# difference between stemmer and lemmatizer
print(
    "For original word(s) '{}' and '{}' the stemmer: '{}' '{}' (count 1 each), the lemmatizer: '{}' '{}' (count 2)"
    .format(
        word_tokenized_no_punct_no_sw[1],
        word_tokenized_no_punct_no_sw[6],
        word_tokenize_list_no_punct_lc_no_stowords_stem[1],
        word_tokenize_list_no_punct_lc_no_stowords_stem[6],
        word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1],
        word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1]
    )
)

Gives this output:

1) NLTK tokenizer, num words: 20 for list: ['Ieri', 'sono', 'andato', 'in', 'due', 'supermercati', '.', 'Oggi', 'volevo', 'andare', "all'ippodromo", '.', 'Stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure', '.']
2) Clean punctuation, num words: 17 for list: ['ieri', 'sono', 'andato', 'in', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure']
3) Clean stop-words, num words: 12 for list: ['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'pizza', 'verdure']
4.1) lemmatizer, num words: 12 for list: [u'ieri', u'andarsene', u'due', u'supermercato', u'oggi', u'volere', u'andare', u"all'ippodromo", u'stasera', u'mangiare', u'pizza', u'verdura']
4.2) stemmer, num words: 12 for list: [u'ier', u'andat', u'due', u'supermerc', u'oggi', u'vol', u'andar', u"all'ippodrom", u'staser', u'mang', u'pizz', u'verdur']
For original word(s) 'andato' and 'andare' the stemmer: 'andat' 'andar' (count 1 each), the lemmatizer: 'andarsene' 'andarsene' (count 2)

Solution

  • I'll try to answer your question, knowing that I don't know a lot about italian!

    1) As far as I know, the main responsibility for removing apostrophe is the tokenizer, and as such the nltk italian tokenizer seems to have failed.

    3) A simple thing you can do about it is call the replace method (although you probably will have to use the re package for more complicated pattern), an example:

    word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
    word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
    

    It yields:

    ['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', 'all', 'ippodromo', 'stasera', 'mangio', 'pizza', 'verdure']
    

    2) An alternative to pattern would be treetagger, granted it is not the easiest install of all (you need the python package and the tool itself, however after this part it works on windows and Linux).

    A simple example with your example above:

    import treetaggerwrapper 
    from pprint import pprint
    
    it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
    tagger = treetaggerwrapper.TreeTagger(TAGLANG="it")
    tags = tagger.tag_text(it_string)
    pprint(treetaggerwrapper.make_tags(tags))
    

    The pprint yields:

    [Tag(word=u'Ieri', pos=u'ADV', lemma=u'ieri'),
     Tag(word=u'sono', pos=u'VER:pres', lemma=u'essere'),
     Tag(word=u'andato', pos=u'VER:pper', lemma=u'andare'),
     Tag(word=u'in', pos=u'PRE', lemma=u'in'),
     Tag(word=u'due', pos=u'ADJ', lemma=u'due'),
     Tag(word=u'supermercati', pos=u'NOM', lemma=u'supermercato'),
     Tag(word=u'.', pos=u'SENT', lemma=u'.'),
     Tag(word=u'Oggi', pos=u'ADV', lemma=u'oggi'),
     Tag(word=u'volevo', pos=u'VER:impf', lemma=u'volere'),
     Tag(word=u'andare', pos=u'VER:infi', lemma=u'andare'),
     Tag(word=u"all'", pos=u'PRE:det', lemma=u'al'),
     Tag(word=u'ippodromo', pos=u'NOM', lemma=u'ippodromo'),
     Tag(word=u'.', pos=u'SENT', lemma=u'.'),
     Tag(word=u'Stasera', pos=u'ADV', lemma=u'stasera'),
     Tag(word=u'mangio', pos=u'VER:pres', lemma=u'mangiare'),
     Tag(word=u'la', pos=u'DET:def', lemma=u'il'),
     Tag(word=u'pizza', pos=u'NOM', lemma=u'pizza'),
     Tag(word=u'con', pos=u'PRE', lemma=u'con'),
     Tag(word=u'le', pos=u'DET:def', lemma=u'il'),
     Tag(word=u'verdure', pos=u'NOM', lemma=u'verdura'),
     Tag(word=u'.', pos=u'SENT', lemma=u'.')]
    

    It also tokenized pretty nicely the all'ippodromo to al and ippodromo (which is hopefully correct) under the hood before lemmatizing. Now we just need to apply the removal of stop words and punctuation and it will be fine.

    The doc for installing the TreeTaggerWrapper library for python