I am using Ruby for this. Freeling (a NLP tool) has a shallow parser which returns a string like this for the text "I just read the book, the grasshopper lies heavy" when I run a shallow parsing command.
a = <<EOT
S_[
sn-chunk_[
+(I i PRP -)
]
adv_[
+(just just RB -)
]
vb-chunk_[
+(read read VB -)
]
sn-chunk_[
(the the DT -)
+n-chunk_[
(book book NN -)
+n-chunk_[
+(The_Grasshopper_Lies_Heavy the_grasshopper_lies_heavy NP -)
]
]
]
st-brk_[
+(. . Fp -)
]
]
EOT
I want to get the following array from this:
["I", "just", "read", "the book The Grasshopper Lies Heavy","."]
(I want to merge the words that are under a tree and have it as a single array element.)
So far, I have written this much:
b = a.gsub(/.*\[/,'[').gsub(/.*\+?\((\w+|.) .*/,'\1').gsub(/\n| /,"").gsub("_","")
which returns
[[I][just][read][the[book[The Grasshopper Lies Heavy]]][.]]
So, how can i get the desired array?
From your solution so far:
result = a.gsub(/.*\[/,'[').gsub(/.*\+?\((\w+|.) .*/,'\1').gsub(/\n| /,"").gsub("_"," ")
result.split('][').map { |s| s.gsub(/\[|\]/, ' ').strip } # ["I", "just", "read", "the book The Grasshopper Lies Heavy", "."]