pythonnltknltk-trainer

How to add a custom corpora to local machine in nltk


I have a custom corpora that created with data which i need to do some classification. I have the dataset in a same format that movie_reviews corpora contains. According to nltk documentation i use following code to access to movie_reviews corpora. Is there anyway to add any custom corpora to nltk_data/corpora directory and access that corpora as the same way we access existing corpora.

    import nltk
    from nltk.corpus import movie_reviews

    documents = [(list(movie_reviews.words(fileid)), category)
         for category in movie_reviews.categories()
         for fileid in movie_reviews.fileids(category)]

Solution

  • While you could hack the nltk to make your corpus look like a built-in nltk corpus, that's the wrong way to go about it. The nltk provides a rich collection of "corpus readers" that you can use to read your corpora from wherever you keep them, without moving them to the nltk_data directory or hacking the nltk source. The nltk's own corpora use the same corpus readers behind the scenes, so your reader will have all the methods and behavior of equivalent built-in corpora.

    Let's see how the movie_reviews corpus is defined in nltk/corpora/__init__.py:

    movie_reviews = LazyCorpusLoader(
        'movie_reviews', CategorizedPlaintextCorpusReader,
        r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*',
        encoding='ascii')
    

    You can ignore the LazyCorpusLoader part; it's for providing corpora that your program will most likely never use. The rest shows that the movie review corpus is read with a CategorizedPlaintextCorpusReader, that its files all end in .txt, and that the reviews are sorted into categories through being in the subdirectories pos and neg. Finally, the corpus encoding is ascii. So define your own corpus like this (changing values as needed):

    mycorpus = nltk.corpus.reader.CategorizedPlaintextCorpusReader(
        r"/home/user/path/to/my_corpus",
        r'(?!\.).*\.txt', 
        cat_pattern=r'(neg|pos)/.*',
        encoding="ascii")
    

    That's it; you can now call mycorpus.words(), mycorpus.sents(categories="neg"), etc., just as if this was a corpus provided by the nltk.