pythonparsingnltkstanford-nlp

Stanford Parser for Python: Output Format


I am currently using the Python interface for the Stanford Parser.

    from nltk.parse.stanford import StanfordParser
    import os

    os.environ['STANFORD_PARSER'] ='/Users/au571533/Downloads/stanford-parser-full-2016-10-31'
    os.environ['STANFORD_MODELS'] = '/Users/au571533/Downloads/stanford-parser-full-2016-10-31'
    parser=StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")

    new=list(parser.raw_parse("The young man who boarded his usual train that Sunday afternoon was twenty-four years old and fat. "))
    print new

The output I get looks something like this:

    [Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['The']), Tree('JJ', ['young']), Tree('NN', ['man'])]), Tree('SBAR', [Tree('WHNP', [Tree('WP', ['who'])]), Tree('S', [Tree('VP', [Tree('VBD', ['boarded']), Tree('NP', [Tree('PRP$', ['his']), Tree('JJ', ['usual']), Tree('NN', ['train'])]), Tree('NP', [Tree('DT', ['that']), Tree('NNP', ['Sunday'])])])])])]), Tree('NP', [Tree('NN', ['afternoon'])]), Tree('VP', [Tree('VBD', ['was']), Tree('NP', [Tree('NP', [Tree('JJ', ['twenty-four']), Tree('NNS', ['years'])]), Tree('ADJP', [Tree('JJ', ['old']), Tree('CC', ['and']), Tree('JJ', ['fat'])])])]), Tree('.', ['.'])])])]

However, I only need the part of speech labels, therefore I'd like to have an output in a format that looks like word/tag.

In java it is possible to specify -outputFormat 'wordsAndTags' and it gives exactly what I want. Any hint on how to implement this in Python?

Help would be GREATLY appreciated. Thanks!

PS: Tried to use the Stanford POSTagger but it is by far less accurate on some of the words I'm interested in.


Solution

  • If you look at the NLTK classes for the Stanford parser, you can see that the the raw_parse_sents() method doesn't send the -outputFormat wordsAndTags option that you want, and instead sends -outputFormat Penn. If you derive your own class from StanfordParser, you could override this method and specify the wordsAndTags format.

    from nltk.parse.stanford import StanfordParser
    
    class MyParser(StanfordParser):
    
            def raw_parse_sents(self, sentences, verbose=False):
            """
            Use StanfordParser to parse multiple sentences. Takes multiple sentences as a
            list of strings.
            Each sentence will be automatically tokenized and tagged by the Stanford Parser.
            The output format is `wordsAndTags`.
    
            :param sentences: Input sentences to parse
            :type sentences: list(str)
            :rtype: iter(iter(Tree))
            """
            cmd = [
                self._MAIN_CLASS,
                '-model', self.model_path,
                '-sentences', 'newline',
                '-outputFormat', 'wordsAndTags',
            ]
            return self._parse_trees_output(self._execute(cmd, '\n'.join(sentences), verbose))