sqlsql-serversql-server-2016

Strip numbers from string after certain characters into a recursive table


I have unstructured, free-form data, which requires cleansing to produce a clean list of all PO numbers in a column. Unfortunately, the column can accommodate multiple PO numbers in various formats (nightmare!!).

See test example:-

DECLARE @temp TABLE
(
    string VARCHAR(50)
)

INSERT INTO @temp (string)
VALUES 
    ('Po998 blah blah po1001'),
    ('PO 999'),
    ('PO 1000 12345 blah PO1002')

Ideally, I'm looking for a single table which holds the following:-

PO - All PO numbers will begin with the letters' PO' and vary in length.

  1. 998
  2. 999
  3. 1000
  4. 1001
  5. 1002

Note 12345 is not a PO and should not appear in the table.

Here is my current attempt to isolate the numerical values after the PO, but it does not give the correct results.

with
clean_stg as (
    select  *,
            ltrim(SUBSTRING(string,CHARINDEX('po',string)+2,len(string))) Val
    from @temp
),

clean_po as (
            select b.*,
            LEFT(b.Val,PATINDEX('%[^0-9]%', b.Val+'a')-1) PO
from clean_stg b
) 

select * from clean_po;

I would love to do this with REGEX in Python, but I need to keep this in the SQL Server environment. I think a recursive Common Table Expression (CTE) might be my answer, but I'm unsure how to structure it best.


Solution

  • I don't think you need recursion here, and the CTE is only useful in that it prevents having to repeat expressions.

    ;WITH x AS 
    (
      SELECT x = REPLACE(UPPER(string), 'PO ', 'PO') 
        FROM @temp
    ) 
    SELECT PO = TRY_CONVERT(int, SUBSTRING(value,3,255))
      FROM x 
     CROSS APPLY STRING_SPLIT(x, ' ')
     WHERE value LIKE 'PO[0-9]%'
     ORDER BY PO;
    

    And in fact you can do it without the CTE this way:

    SELECT PO = TRY_CONVERT(int, SUBSTRING(value,3,255))
      FROM @temp
     CROSS APPLY STRING_SPLIT(REPLACE(UPPER(string), 'PO ', 'PO') , ' ')
     WHERE value LIKE 'PO[0-9]%'
     ORDER BY PO;
    

    It wasn't mentioned but you may want to filter duplicates.

    You can do additional safety checks to filter out a value like PO3x5 instead of returning NULL, but I already think you should have better/cleaner starting data in the first place. Parsing messy, inconsistent strings for nuggets of info is a nightmare.