oracle-databaseplsqlregexp-substr

Filter parts with prefixes of different length using REGEXP_SUBSTR


Basically, I have an input String in PL/SQL that looks like the following:

input varchar2(100) := R985AD5768N2

This string is composed of several information preceded by a certain prefix each and concatenating each block. The prefix list is variable, in this example it is R, D5, G and N. Information may contain numbers and characters as long as there is no character sequence that is a prefix. Also, some prefixes may not necessarily appear inside the input String. My goal is to use REGEXP_SUBSTR to filter the information of each block by its prefix, so having the following results from the example above:

R  -> 985A
D5 -> 768
G  -> NULL
N  -> 2

My general idea is to use a pattern of the form 'R([^R|D5|G|N]+)(R|D5|G|N)?' for R respectively, meaning I require my sequence to start with the character R, then using all following characters until the next prefix (or the end of the string) is reached. With the subexpr argument of REGEXP_SUBSTR I can then directly choose the second part, being the information I am looking for.

I tried several approaches:


REGEXP_SUBSTR(input, 'R([^R|D5|G|N]+)(R|D5|G|N)?', 1, 1, 'c', 1) -- and other prefixes accordingly

is fine except for R which results in only 98, I guess because 5 is appearing in the D5 prefix.

REGEXP_SUBSTR(input, 'R(\w+)(R|D5|G|N)', 1, 1, 'c', 1)

Using the positive approach, it will choose the longest approach, meaning the whole remainder of the input after its respective prefix.

I tried several variations of these, like replacing some brackets but still it seems like one key flaw is that the D5 prefix is not recognized as a literal in its respective bracket. I saw some approaches that use some lookahead parameters but sadly these are not supported, as far as I know.

Any ideas on how to clarify this?


Solution

  • If you have specific terms and you want to find the first occurrence of each of them then you can hard-code those prefixes and use a non-greedy match looking for either a trailing prefix or an end-of-string:

    SELECT value,
           REGEXP_SUBSTR(value, '^R(.*?)(R|G|N|D5|$)', 1, 1, NULL, 1) AS r,
           REGEXP_SUBSTR(value, 'G(.*?)(R|G|N|D5|$)', 1, 1, NULL, 1) AS g,
           REGEXP_SUBSTR(value, 'N(.*?)(R|G|N|D5|$)', 1, 1, NULL, 1) AS n,
           REGEXP_SUBSTR(value, 'D5(.*?)(R|G|N|D5|$)', 1, 1, NULL, 1) AS d5
    FROM   sample_data
    

    Which, for the sample data:

    CREATE TABLE sample_data (value) AS
      SELECT 'R985AD5768N2' FROM DUAL;
    

    Outputs:

    VALUE R G N D5
    R985AD5768N2 985A null 2 768

    Or, equivalently in PL/SQL:

    DECLARE
      value VARCHAR2(20) := 'R985AD5768N2';
      r  VARCHAR2(20) := REGEXP_SUBSTR(value, '^R(.*?)(R|G|N|D5|$)', 1, 1, NULL, 1);
      g  VARCHAR2(20) := REGEXP_SUBSTR(value, 'G(.*?)(R|G|N|D5|$)', 1, 1, NULL, 1);
      n  VARCHAR2(20) := REGEXP_SUBSTR(value, 'N(.*?)(R|G|N|D5|$)', 1, 1, NULL, 1);
      d5 VARCHAR2(20) := REGEXP_SUBSTR(value, 'D5(.*?)(R|G|N|D5|$)', 1, 1, NULL, 1);
    BEGIN
      DBMS_OUTPUT.PUT_LINE( 'R = ' || r);
      DBMS_OUTPUT.PUT_LINE( 'G = ' || g);
      DBMS_OUTPUT.PUT_LINE( 'N = ' || n);
      DBMS_OUTPUT.PUT_LINE( 'D5 = ' || d5);
    END;
    /
    

    Outputs:

    R = 985A
    G = 
    N = 2
    D5 = 768
    

    If you want all the prefixes and suffixes in the order they occur then you can use a recursive query and look for only the prefixes and find the start and end positions of those and then the suffixes will be the sub-strings between them.

    WITH bounds (value, sspos, sepos, espos, eepos) AS (
      SELECT value,
             1,
             2,
             REGEXP_INSTR(value, 'R|G|N|D5', 2, 1, 0),
             REGEXP_INSTR(value, 'R|G|N|D5', 2, 1, 1)
      FROM   sample_data
      WHERE  value LIKE 'R%'
    UNION ALL
      SELECT value,
             espos,
             eepos,
             REGEXP_INSTR(value, 'R|G|N|D5', eepos, 1, 0),
             REGEXP_INSTR(value, 'R|G|N|D5', eepos, 1, 1)
      FROM   bounds
      WHERE  eepos > 0
    )
    SELECT SUBSTR(value, sspos, sepos - sspos) AS prefix,
           CASE WHEN espos = 0
           THEN SUBSTR(value, sepos)
           ELSE SUBSTR(value, sepos, espos - sepos)
           END AS suffix
    FROM   bounds
    

    Which outputs:

    PREFIX SUFFIX
    R 985A
    D5 768
    N 2

    fiddle