uimaruta

How to declare # except line break for later usage?


I use skip wild card # for text between rule elements.
However, I mark always per line, thus I m able to use #{-CONTAINS(BREAK)}
for example RuleElementA #{-CONTAINS(BREAK)} RuleElementB must be on a single line
How can I declare/save #{-CONTAINS(BREAK)} so that i could use later just shortcut like
RuleElementA sc RuleElementB ?


Solution

  • You should try to annotate first your building block (i.e. Lines) and create your target annotations based on that (so-called Bottom-Up Matching Strategy in UIMA Ruta).

    Therefore, your can annotate all the lines in the input document by following a naive approach:

    DECLARE Line;
    ADDRETAINTYPE(BREAK);
    BREAK #{-> MARKONCE(Line)} @BREAK;
    REMOVERETAINTYPE(BREAK);
    

    This would allow you to remain on the line level while creating the target annotations. You could then iterate over all the Lines in the document in order to ensure the correctness of your span:

    BLOCK (forEach) Line{CONTAINS(W)}{
        RuleElementA # RuleElementB
    }
    

    Alternatively, you could make use of the PlainTextAnnotator which is by default, part of the UIMA Ruta installation package. This approach can guarantee you a better line detection:

    ENGINE utils.PlainTextAnnotator;
    TYPESYSTEM Utils.PlainTextTypeSystem;
    
    EXEC(PlainTextAnnotator, {Line, EmptyLine});
    DECLARE FreeLine, LineFree;
    ADDRETAINTYPE(WS);
    EmptyLine Line{-> FreeLine};
    Line{-> LineFree} BREAK[1,2] @EmptyLine;
    Line{-> TRIM(WS)};
    FreeLine{-> TRIM(WS)};
    LineFree{-> TRIM(WS)};
    REMOVERETAINTYPE(WS);