nlprutapega

How to match optional Number along with alphanumeric in Ruta Script


I am working on entity extraction in Pega. I have requirement to match a policy number which has 3 parts:

1) Optionally 1 would be first character in policy. It is optional

2) alphanumeric of length 2 followed by optionally Hyphen or Space

3) alphanumeric of length 3

So some examples of formats are:

AB-CDE, AB CDE, ABCDE, 1AB-CDE

23-456, 23 456, 23456, 123456

AB-2B4, AB-B2C, A1-2B4, 2A-34B, 12A-34B, 123-45C etc.

I am facing problem whenever policy number is starting with 2 or 3 digits or it don't have any space or hyphen.

For example 12A-34B, 123-45C, 23456, 123456.

I have written below script:

PACKAGE uima.ruta.example;
Document{-> RETAINTYPE(SPACE)};


("1")+? ((NUM* W*)|(W* NUM*)){REGEXP(".{2}")} ("-"|SPACE)? ((NUM* W* NUM*)|(W* NUM* W*)){REGEXP(".{3}")->MARK(EntityType,1,4)};

((NUM* W*)|(W* NUM*)){REGEXP(".{2}")} ("-"|SPACE)? ((NUM* W* NUM*)|(W* NUM* W*)){REGEXP(".{3}")->MARK(EntityType,1,3)};

This code is working fine for patterns having space/hyphen like: AB-CDE, AB CDE, 1AB-CDE. But not working if don't have space and hyphen or pattern starts with 2 or 3 digits.

Please help to write correct pattern. Thanks in advance.


Solution

  • The UIMA Ruta seed annotation NUM, covers the whole number. Therefore, examples like 23456, 123456 cannot be split in subannotations by Ruta.

    A solution would be to use pure regexp to annotate all the mentioned examples:

    "\\w{2,3}[\\-|\\s]?\\w{2,3}" -> EntityType;