I am trying to use Stanford POS-tagger, I want to ask if it is possible to parse (actually only pos tag would be enough) an english text and output the results in conll format. Is there such an option?
I am using the full 3.2.0 version of the Stanford pos tagger
Thanks a lot
When it comes to the CONLL format, i presume you mean the CONLL2000 chunking task format as such:
He PRP B-NP
reckons VBZ B-VP
the DT B-NP
current JJ I-NP
account NN I-NP
deficit NN I-NP
will MD B-VP
narrow VB I-VP
to TO B-PP
only RB B-NP
# # I-NP
1.8 CD I-NP
billion CD I-NP
in IN B-PP
September NNP B-NP
. . O
There are three columns in the CONLL chunking task format:
token
(i.e. word)POS
tagBIO
(begin, inside, outside) of chunk/phrase tagSadly, if you use the stanford MaxEnt tagger, it only give you the token
and POS
information but has no BIO
chunk information.
java -cp stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/left3words-wsj-0-18.tagger -textFile short.txt -outputFormat tsv 2> /dev/null
Using the above command the Stanford POS tagger already give you the tab separated format, just that it's without the 3rd column (see http://nlp.stanford.edu/software/pos-tagger-faq.shtml):
He PRP
reckons VBZ
the DT
...
To get the BIO
colum, you would require either:
see http://www-nlp.stanford.edu/links/statnlp.html for a list of chunker/parser, if you want to stick with stanford tools, i suggest the stanford parser but it gives you the bracketed parse format, which you have to do some post-processing to get it into CONLL2000 format, see http://nlp.stanford.edu/software/lex-parser.shtml