I am working on entity extraction in Pega. I have requirement to match a policy number which has 3 parts:
1) Optionally 1 would be first character in policy. It is optional
2) alphanumeric of length 2 followed by optionally Hyphen or Space
3) alphanumeric of length 3
So some examples of formats are:
AB-CDE, AB CDE, ABCDE, 1AB-CDE
23-456, 23 456, 23456, 123456
AB-2B4, AB-B2C, A1-2B4, 2A-34B, 12A-34B, 123-45C etc.
I am facing problem whenever policy number is starting with 2 or 3 digits or it don't have any space or hyphen.
For example 12A-34B, 123-45C, 23456, 123456.
I have written below script:
PACKAGE uima.ruta.example;
Document{-> RETAINTYPE(SPACE)};
("1")+? ((NUM* W*)|(W* NUM*)){REGEXP(".{2}")} ("-"|SPACE)? ((NUM* W* NUM*)|(W* NUM* W*)){REGEXP(".{3}")->MARK(EntityType,1,4)};
((NUM* W*)|(W* NUM*)){REGEXP(".{2}")} ("-"|SPACE)? ((NUM* W* NUM*)|(W* NUM* W*)){REGEXP(".{3}")->MARK(EntityType,1,3)};
This code is working fine for patterns having space/hyphen like: AB-CDE, AB CDE, 1AB-CDE. But not working if don't have space and hyphen or pattern starts with 2 or 3 digits.
Please help to write correct pattern. Thanks in advance.
The UIMA Ruta seed annotation NUM, covers the whole number. Therefore, examples like 23456
, 123456
cannot be split in subannotations by Ruta.
A solution would be to use pure regexp to annotate all the mentioned examples:
"\\w{2,3}[\\-|\\s]?\\w{2,3}" -> EntityType;