I am currently using the Python interface for the Stanford Parser.
from nltk.parse.stanford import StanfordParser
import os
os.environ['STANFORD_PARSER'] ='/Users/au571533/Downloads/stanford-parser-full-2016-10-31'
os.environ['STANFORD_MODELS'] = '/Users/au571533/Downloads/stanford-parser-full-2016-10-31'
parser=StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
new=list(parser.raw_parse("The young man who boarded his usual train that Sunday afternoon was twenty-four years old and fat. "))
print new
The output I get looks something like this:
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['The']), Tree('JJ', ['young']), Tree('NN', ['man'])]), Tree('SBAR', [Tree('WHNP', [Tree('WP', ['who'])]), Tree('S', [Tree('VP', [Tree('VBD', ['boarded']), Tree('NP', [Tree('PRP$', ['his']), Tree('JJ', ['usual']), Tree('NN', ['train'])]), Tree('NP', [Tree('DT', ['that']), Tree('NNP', ['Sunday'])])])])])]), Tree('NP', [Tree('NN', ['afternoon'])]), Tree('VP', [Tree('VBD', ['was']), Tree('NP', [Tree('NP', [Tree('JJ', ['twenty-four']), Tree('NNS', ['years'])]), Tree('ADJP', [Tree('JJ', ['old']), Tree('CC', ['and']), Tree('JJ', ['fat'])])])]), Tree('.', ['.'])])])]
However, I only need the part of speech labels, therefore I'd like to have an output in a format that looks like word/tag.
In java it is possible to specify -outputFormat 'wordsAndTags' and it gives exactly what I want. Any hint on how to implement this in Python?
Help would be GREATLY appreciated. Thanks!
PS: Tried to use the Stanford POSTagger but it is by far less accurate on some of the words I'm interested in.
If you look at the NLTK classes for the Stanford parser, you can see that the the raw_parse_sents()
method doesn't send the -outputFormat wordsAndTags
option that you want, and instead sends -outputFormat Penn
.
If you derive your own class from StanfordParser
, you could override this method and specify the wordsAndTags
format.
from nltk.parse.stanford import StanfordParser
class MyParser(StanfordParser):
def raw_parse_sents(self, sentences, verbose=False):
"""
Use StanfordParser to parse multiple sentences. Takes multiple sentences as a
list of strings.
Each sentence will be automatically tokenized and tagged by the Stanford Parser.
The output format is `wordsAndTags`.
:param sentences: Input sentences to parse
:type sentences: list(str)
:rtype: iter(iter(Tree))
"""
cmd = [
self._MAIN_CLASS,
'-model', self.model_path,
'-sentences', 'newline',
'-outputFormat', 'wordsAndTags',
]
return self._parse_trees_output(self._execute(cmd, '\n'.join(sentences), verbose))