pythonnltktext-analysistext-processing

Splitting words using nltk module in Python


I am trying to find a way for splitting words in Python using the nltk module. I am unsure how to reach my goal given the raw data I have which is a list of tokenized words e.g.

['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']

As you can see many words are stuck together (i.e. 'to' and 'produce' are stuck in one string 'toproduce'). This is an artifact of scraping data from a PDF file and I would like to find a way using the nltk module in python to split the stuck-together words (i.e. split 'toproduce' into two words: 'to' and 'produce'; split 'standardoperatingprocedures' into three words: 'standard', 'operating', 'procedures').

I appreciate any help!


Solution

  • I believe you will want to use word segmentation in this case, and I am not aware of any word segmentation features in the NLTK that will deal with English sentences without spaces. You could use pyenchant instead. I offer the following code only by way of example. (It would work for a modest number of relatively short strings--such as the strings in your example list--but would be highly inefficient for longer strings or more numerous strings.) It would need modification, and it will not successfully segment every string in any case.

    import enchant  # pip install pyenchant
    eng_dict = enchant.Dict("en_US")
    
    def segment_str(chars, exclude=None):
        """
        Segment a string of chars using the pyenchant vocabulary.
        Keeps longest possible words that account for all characters,
        and returns list of segmented words.
    
        :param chars: (str) The character string to segment.
        :param exclude: (set) A set of string to exclude from consideration.
                        (These have been found previously to lead to dead ends.)
                        If an excluded word occurs later in the string, this
                        function will fail.
        """
        words = []
    
        if not chars.isalpha():  # don't check punctuation etc.; needs more work
            return [chars]
    
        if not exclude:
            exclude = set()
    
        working_chars = chars
        while working_chars:
            # iterate through segments of the chars starting with the longest segment possible
            for i in range(len(working_chars), 1, -1):
                segment = working_chars[:i]
                if eng_dict.check(segment) and segment not in exclude:
                    words.append(segment)
                    working_chars = working_chars[i:]
                    break
            else:  # no matching segments were found
                if words:
                    exclude.add(words[-1])
                    return segment_str(chars, exclude=exclude)
                # let the user know a word was missing from the dictionary,
                # but keep the word
                print('"{chars}" not in dictionary (so just keeping as one segment)!'
                      .format(chars=chars))
                return [chars]
        # return a list of words based on the segmentation
        return words
    

    As you can see, this approach (presumably) mis-segments only one of your strings:

    >>> t = ['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']
    >>> [segment(chars) for chars in t]
    "genotypes" not in dictionary (so just keeping as one segment)
    [['using', 'various', 'molecular', 'biology'], ['techniques'], ['to', 'produce'], ['genotypes'], ['following'], ['standard', 'operating', 'procedures'], ['.'], ['Operate', 'and', 'maintain', 'automated', 'equipment'], ['.'], ['Updates', 'ample', 'tracking', 'systems', 'and', 'process'], ['documentation'], ['to', 'allow', 'accurate'], ['monitoring'], ['and', 'rapid'], ['progression'], ['of', 'casework']]
    

    You can then use chain to flatten this list of lists:

    >>> from itertools import chain
    >>> list(chain.from_iterable(segment_str(chars) for chars in t))
    "genotypes" not in dictionary (so just keeping as one segment)!
    ['using', 'various', 'molecular', 'biology', 'techniques', 'to', 'produce', 'genotypes', 'following', 'standard', 'operating', 'procedures', '.', 'Operate', 'and', 'maintain', 'automated', 'equipment', '.', 'Updates', 'ample', 'tracking', 'systems', 'and', 'process', 'documentation', 'to', 'allow', 'accurate', 'monitoring', 'and', 'rapid', 'progression', 'of', 'casework']