bashtreetagger

Bash: Extract cells from output formatted as table


I am using TreeTagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) to extract nouns from a text. My problem is that the output is formatted as such:

word    pos     lemma

The     DT      the 
TreeTagger      NP      TreeTagger 
is      VBZ     be 
easy    JJ      easy 
to      TO      to 
use     VB      use 

with apparently no option to get nouns only ("NP" and "NN"). With bash, how could I get the cells in the first column that have "NP" or "NN" in the second column?


Solution

  • You can use an awk for this:

    awk '$2 ~ /^N[PN]$/{print $1}' file
    
    TreeTagger
    

    Regex /^N[PN]$/ will match either NP or NN

    As @Cyrus rightly commented below, you can use alternation in your regex as:

    awk '$2 ~ /^(NP|NN)$/ {print $1}' file