I have a document which contains sections such as Assessments, HPI, ROS, Vitals etc. I want to extract notes in each section. I am using GATE for this purpose. I have made a JAPE file which will extract notes in the Assessment section. Following is the grammar,
Input: Token
Options: control=appelt debug=true
Rule: Assess
({Token.string =~"(?i)diagnose[d]?"}{Token.string=="with"} | {Token.string=~"(?i)suffering"}{Token.string=~"(?i)from"} | {Token.string=~"(?i)suffering"}{Token.string=~"(?i)with"})
(
({Token})*
):assessments
({Token.string =~"(?i)HPI"} | {Token.string =~"(?i)ROS"} | {Token.string =~"(?i)EXAM"} | {Token.string =~"(?i)VITAL[S]"} | {Token.string =~"(?i)TREATMENT[s]"} |{Token.string=~"(?i)use[d]?"}{Token.string=~"(?i)orderset[s]?"} | {Token.string=~"$"})
-->
:assessments.Assessments = {}
Now, when the assessment section is in the end of the document I can retrieve the notes properly. But if it is somewhere between two sections then this will return entire document from assessment section till the end of file.
I have tried using {Token.string=~"$"} in different ways but could not extract ONLY THE ASSESSMENT SECTION IRRESPECTIVE OF ITS PLACE IN THE DOC.
Please explain how can I achieve this using JAPE grammar.
That is correct since Appelt mode always prefers the longest possible overall match. Since any Token can match string =~ "$"
the assessments
label will grab all but the final token in the document.
I would adopt a two pass approach, using an initial gazetteer or JAPE phase to annotate the "section headings" and then another phase with only these heading annotations in its input line
Imports: { import static gate.Utils.*; }
Phase: AnnotateBetweenHeadings
Input: Heading
Options: control = appelt
Rule: TwoHeadings
({Heading.type ="assessments"}):h1
(({Heading})?):h2
-->
{
Long endOffset = end(doc);
AnnotationSet h2Annots = bindings.get("h2");
if(h2Annots != null && !h2Annots.isEmpty()) {
endOffset = start(h2Annots);
}
outputAS.add(end(bindings.get("h1")), endOffset, "Assessments", featureMap());
}
This will annotate everything between the end of the assessments heading and the start of the following heading, or the end of the document if there is no following heading.