I am using TreeTagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) to extract nouns from a text. My problem is that the output is formatted as such:
word pos lemma
The DT the
TreeTagger NP TreeTagger
is VBZ be
easy JJ easy
to TO to
use VB use
with apparently no option to get nouns only ("NP" and "NN"). With bash, how could I get the cells in the first column that have "NP" or "NN" in the second column?
You can use an awk for this:
awk '$2 ~ /^N[PN]$/{print $1}' file
TreeTagger
Regex /^N[PN]$/
will match either NP
or NN
As @Cyrus rightly commented below, you can use alternation in your regex as:
awk '$2 ~ /^(NP|NN)$/ {print $1}' file