nltk

NLTK reconstruct sentence from tokens


I have used NLTK to tokenise a sentance, I would however now like to reconstruct the sentance into a string. I've looked over the docs but can't see an obvious wat to do this. Is this possible at all?

    tokens = [token.lower() for token in tokensCorrect]

Solution

  • The nltk provides no such function. Whitespace is thrown away during tokenization, so there is no way to get back exactly what you started with; the whitespace might have included newlines and multiple spaces, and there's no way to get these back. The best you can do is to join the sentence into a string that looks like a normal sentence. A simple " ".join(tokens) will put a space before and after all punctuation, which looks odd:

    >>> print(" ".join(tokens))
    'This is a sentence .'
    

    So you need to get rid of spaces before most punctuation, except for a select few like ( and `` that should have the space after them removed. Even then it's sometimes guesswork, since the apostrophe ' is sometimes used between words, sometimes before, and sometimes after. ("Nuthin' doin', y'all!")

    My recommendation is to hold on to the original strings from which you tokenized the sentence, and go back to those. You don't show where your sentences come from so there's nothing more to say really.

    Edit (April 2023):

    In the meantime the nltk provides the following method for turning a list of tokens into a normally punctuated sentence. Note that you still can't count on getting exactly what you started with:

    >>> from nltk.tokenize.treebank import TreebankWordDetokenizer
    >>> example = ['Here', 'I', 'have', 'a', 'sentence', ',', 'do', "n't", 'I', '?']
    >>> TreebankWordDetokenizer().detokenize(example)
    "Here I have a sentence, don't I?"