pythontextnlpnltktext-analysis

Tokenization and lemmatization for TF-IDF use for bunch of txt files using NLTK library


Doing the text analysis of italian text (tokenization, lemmalization) for future use of TF-IDF technics and constructing clusters based on that. For preprocessing I use NLTK and for one text file everything is working fine:

import nltk
from nltk.stem.wordnet import WordNetLemmatizer

it_stop_words = nltk.corpus.stopwords.words('italian')

lmtzr = WordNetLemmatizer()

with open('3003.txt', 'r' , encoding="latin-1") as myfile:
    data=myfile.read()

word_tokenized_list = nltk.tokenize.word_tokenize(data)

word_tokenized_no_punct = [str.lower(x) for x in word_tokenized_list if x not in string.punctuation]

word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]

word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]

word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lmtzr.lemmatize(x) for x in word_tokenized_no_punct_no_sw_no_apostrophe]

But the question is that I need to perform the following to bunch of .txt files in the folder. For that I'm trying to use possibilities of PlaintextCorpusReader():

from nltk.corpus.reader.plaintext import PlaintextCorpusReader

corpusdir = 'reports/'

newcorpus = PlaintextCorpusReader(corpusdir, '.txt')

Basically I can not just apply newcorpus into the previous functions because it's an object and not a string. So my questions are:

  1. How should the functions look like (or how should I change the existing ones for a distinct file) for doing tokenization and lemmatization for a corpus of files (using PlaintextCorpusReader())
  2. How would the TF-IDF approach (standard sklearn approach of vectorizer = TfidfVectorizer() will look like in PlaintextCorpusReader()

Many Thanks!


Solution

  • I think your question can be answered by reading: this question, this another one and [TfidfVectorizer docs][3]. For completeness, I wrapped the answers below:


    First, you want to get the files ids, by the first question you can get them as follows:

    ids = newcorpus.fileids()
    

    Then, based on the second quetion you can retrieve documents' words, sentences or paragraphs:

    doc_words = []
    doc_sents = []
    doc_paras = []
    for id_ in ids:
        # Get words
        doc_words.append(newcorpus.words(id_))
        # Get sentences
        doc_sents.append(newcorpus.sents(id_))
        # Get paragraph
        doc_paras.append(newcorpus.paras(id_))
    

    Now, on the ith position of doc_words, doc_sents and doc_paras you have all words, sentences and paragraphs (respectively) for every document in the corpus.

    For tf-idf you probably just want the words. Since TfidfVectorizer.fit's method gets an iterable which yields str, unicode or file objects, you need to either transform your documents (array of tokenized words) into a single string, or use a similar approach to this one. The latter solution uses a dummy tokenizer to deal directly with arrays of words.

    You can also pass your own tokenizer to TfidVectorizer and use PlaintextCorpusReader simply for file reading.