pythonnlpstanford-nlptokenize

Only Get Tokenized Sentences as Output from Stanford Core NLP


I need to split sentences. I'm using the pycorenlp wrapper for python3. I've started the server from my jar directory using: java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

I've run the following commands:

from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')    
text = 'Pusheen and Smitha walked along the beach. Pusheen wanted to surf, but fell off the surfboard.'
output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit', 'outputFormat': 'text'})
print (output)

which gave the following output:

Sentence #1 (8 tokens):
Pusheen and Smitha walked along the beach.
[Text=Pusheen CharacterOffsetBegin=0 CharacterOffsetEnd=7]
[Text=and CharacterOffsetBegin=8 CharacterOffsetEnd=11]
[Text=Smitha CharacterOffsetBegin=12 CharacterOffsetEnd=18]
[Text=walked CharacterOffsetBegin=19 CharacterOffsetEnd=25]
[Text=along CharacterOffsetBegin=26 CharacterOffsetEnd=31]
[Text=the CharacterOffsetBegin=32 CharacterOffsetEnd=35]
[Text=beach CharacterOffsetBegin=36 CharacterOffsetEnd=41]
[Text=. CharacterOffsetBegin=41 CharacterOffsetEnd=42]
Sentence #2 (11 tokens):
Pusheen wanted to surf, but fell off the surfboard.
[Text=Pusheen CharacterOffsetBegin=43 CharacterOffsetEnd=50]
[Text=wanted CharacterOffsetBegin=51 CharacterOffsetEnd=57]
[Text=to CharacterOffsetBegin=58 CharacterOffsetEnd=60]
[Text=surf CharacterOffsetBegin=61 CharacterOffsetEnd=65]
[Text=, CharacterOffsetBegin=65 CharacterOffsetEnd=66]
[Text=but CharacterOffsetBegin=67 CharacterOffsetEnd=70]
[Text=fell CharacterOffsetBegin=71 CharacterOffsetEnd=75]
[Text=off CharacterOffsetBegin=76 CharacterOffsetEnd=79]
[Text=the CharacterOffsetBegin=80 CharacterOffsetEnd=83]
[Text=surfboard CharacterOffsetBegin=84 CharacterOffsetEnd=93]
[Text=. CharacterOffsetBegin=93 CharacterOffsetEnd=94]

I need the output in the following format:

Pusheen and Smitha walked along the beach.
Pusheen wanted to surf, but fell off the surfboard.

Solution

  • Try the new "shiny" Stanford CoreNLP API in NLTK =)

    First:

    pip install -U nltk[corenlp]
    

    On command-line:

    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
    

    Then in Python, the standard usage is:

    >>> from nltk.parse.corenlp import CoreNLPParser
    >>> stanford = CoreNLPParser('http://localhost:9000')
    >>> text = 'Pusheen and Smitha walked along the beach. Pusheen wanted to surf, but fell off the surfboard.'
    
    # Gets you the tokens.
    >>> ' '.join(next(stanford.raw_parse(text)).leaves())
    u'Pusheen and Smitha walked along the beach . Pusheen wanted to surf , but fell off the surfboard .'
    
    # Gets you the Tree object.
    >>> next(stanford.raw_parse(text))
    Tree('ROOT', [Tree('S', [Tree('S', [Tree('NP', [Tree('NNP', ['Pusheen']), Tree('CC', ['and']), Tree('NNP', ['Smitha'])]), Tree('VP', [Tree('VBD', ['walked']), Tree('PP', [Tree('IN', ['along']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['beach'])])])]), Tree('.', ['.'])]), Tree('NP', [Tree('NNP', ['Pusheen'])]), Tree('VP', [Tree('VP', [Tree('VBD', ['wanted']), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('NN', ['surf'])])])]), Tree(',', [',']), Tree('CC', ['but']), Tree('VP', [Tree('VBD', ['fell']), Tree('PRT', [Tree('RP', ['off'])]), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['surfboard'])])])]), Tree('.', ['.'])])])
    
    # Gets you the pretty png tree.
    >>> next(stanford.raw_parse(text)).draw()
    

    [out]:

    enter image description here


    To get the tokenized sentence, you'll need some finesse:

    >>> from nltk.parse.corenlp import CoreNLPParser
    >>> stanford = CoreNLPParser('http://localhost:9000')
    
    # Using the CoreNLPParser.api_call() function, ...
    >>> stanford.api_call
    <bound method CoreNLPParser.api_call of <nltk.parse.corenlp.CoreNLPParser object at 0x107131b90>>
    
    # ... , you can get the JSON output from the CoreNLP tool.
    >>> stanford.api_call(text, properties={'annotators': 'tokenize,ssplit'})
    {u'sentences': [{u'tokens': [{u'index': 1, u'word': u'Pusheen', u'after': u' ', u'characterOffsetEnd': 7, u'characterOffsetBegin': 0, u'originalText': u'Pusheen', u'before': u''}, {u'index': 2, u'word': u'and', u'after': u' ', u'characterOffsetEnd': 11, u'characterOffsetBegin': 8, u'originalText': u'and', u'before': u' '}, {u'index': 3, u'word': u'Smitha', u'after': u' ', u'characterOffsetEnd': 18, u'characterOffsetBegin': 12, u'originalText': u'Smitha', u'before': u' '}, {u'index': 4, u'word': u'walked', u'after': u' ', u'characterOffsetEnd': 25, u'characterOffsetBegin': 19, u'originalText': u'walked', u'before': u' '}, {u'index': 5, u'word': u'along', u'after': u' ', u'characterOffsetEnd': 31, u'characterOffsetBegin': 26, u'originalText': u'along', u'before': u' '}, {u'index': 6, u'word': u'the', u'after': u' ', u'characterOffsetEnd': 35, u'characterOffsetBegin': 32, u'originalText': u'the', u'before': u' '}, {u'index': 7, u'word': u'beach', u'after': u'', u'characterOffsetEnd': 41, u'characterOffsetBegin': 36, u'originalText': u'beach', u'before': u' '}, {u'index': 8, u'word': u'.', u'after': u' ', u'characterOffsetEnd': 42, u'characterOffsetBegin': 41, u'originalText': u'.', u'before': u''}], u'index': 0}, {u'tokens': [{u'index': 1, u'word': u'Pusheen', u'after': u' ', u'characterOffsetEnd': 50, u'characterOffsetBegin': 43, u'originalText': u'Pusheen', u'before': u' '}, {u'index': 2, u'word': u'wanted', u'after': u' ', u'characterOffsetEnd': 57, u'characterOffsetBegin': 51, u'originalText': u'wanted', u'before': u' '}, {u'index': 3, u'word': u'to', u'after': u' ', u'characterOffsetEnd': 60, u'characterOffsetBegin': 58, u'originalText': u'to', u'before': u' '}, {u'index': 4, u'word': u'surf', u'after': u'', u'characterOffsetEnd': 65, u'characterOffsetBegin': 61, u'originalText': u'surf', u'before': u' '}, {u'index': 5, u'word': u',', u'after': u' ', u'characterOffsetEnd': 66, u'characterOffsetBegin': 65, u'originalText': u',', u'before': u''}, {u'index': 6, u'word': u'but', u'after': u' ', u'characterOffsetEnd': 70, u'characterOffsetBegin': 67, u'originalText': u'but', u'before': u' '}, {u'index': 7, u'word': u'fell', u'after': u' ', u'characterOffsetEnd': 75, u'characterOffsetBegin': 71, u'originalText': u'fell', u'before': u' '}, {u'index': 8, u'word': u'off', u'after': u' ', u'characterOffsetEnd': 79, u'characterOffsetBegin': 76, u'originalText': u'off', u'before': u' '}, {u'index': 9, u'word': u'the', u'after': u' ', u'characterOffsetEnd': 83, u'characterOffsetBegin': 80, u'originalText': u'the', u'before': u' '}, {u'index': 10, u'word': u'surfboard', u'after': u'', u'characterOffsetEnd': 93, u'characterOffsetBegin': 84, u'originalText': u'surfboard', u'before': u' '}, {u'index': 11, u'word': u'.', u'after': u'', u'characterOffsetEnd': 94, u'characterOffsetBegin': 93, u'originalText': u'.', u'before': u''}], u'index': 1}]} 
    
    >>> output_json = stanford.api_call(text, properties={'annotators': 'tokenize,ssplit'})
    >>> len(output_json['sentences'])
    2
    >>> for sent in output_json['sentences']:
    ...     start_offset = sent['tokens'][0]['characterOffsetBegin'] # Begin offset of first token.
    ...     end_offset = sent['tokens'][-1]['characterOffsetEnd'] # End offset of last token.
    ...     sent_str = text[start_offset:end_offset]
    ...     print sent_str
    ... 
    Pusheen and Smitha walked along the beach.
    Pusheen wanted to surf, but fell off the surfboard.