pythonnltkcorpusnltk-trainer

NLTK custom categorized corpus not reading files


I have created my own corpus, similar to the movie_reviews corpus in nltk (categorized by neg|pos.)

Within the neg and pos folders are txt files.

Code:

from nltk.corpus import CategorizedPlaintextCorpusReader

    mr = CategorizedPlaintextCorpusReader('C:\mycorpus', r'(?!\.).*\.txt',
            cat_pattern=r'(neg|pos)/.*')

When I try to read or interact with one of these files, I am unable to.

e.g. len(mr.categories()) runs, but does not return anything:

>>>

I have read multiple documents and questions on here regarding custom categorized corpus', but I am still unable to use them.

Full code:

import nltk
from nltk.corpus import CategorizedPlaintextCorpusReader

mr = CategorizedPlaintextCorpusReader('C:\mycorpus', r'(?!\.).*\.txt',
        cat_pattern=r'(neg|pos)/.*')

len(mr.categories())

I eventually want to be able to preform a naive bayes algorithm against my data but I am unable to read the content.

Paths: C:\mycorpus\pos

C:\mycorpus\neg

Within the pos file is a 'cv.txt' and the neg contains a 'example.txt'


Solution

  • I am using Linux, and the following modification to your code (with toy corpus files) works correctly for me:

    import nltk
    from nltk.corpus import CategorizedPlaintextCorpusReader
    
    import os
    
    
    mr = CategorizedPlaintextCorpusReader(
        '/home/ely/programming/nltk-test/mycorpus',
        r'(?!\.).*\.txt',
        cat_pattern=os.path.join(r'(neg|pos)', '.*')
    )
    
    print(len(mr.categories()))
    

    This suggests it is a problem with the cat_pattern string using / as a file system delimiter when you're on a Windows system.

    Using os.path.join as in my example, or pathlib if using Python 3, would be a good way to solve it so it is OS-agnostic and you don't trip up with the regular expression escape slashes mixed with file system delimiters.

    In fact you may way to use this approach for all of the cases of file system delimiters in your argument strings, and it's generally a good habit to get in for making code portable and avoiding strange string munging tech debt.