I have unstructured, free-form data, which requires cleansing to produce a clean list of all PO numbers in a column. Unfortunately, the column can accommodate multiple PO numbers in various formats (nightmare!!).
See test example:-
DECLARE @temp TABLE
(
string VARCHAR(50)
)
INSERT INTO @temp (string)
VALUES
('Po998 blah blah po1001'),
('PO 999'),
('PO 1000 12345 blah PO1002')
Ideally, I'm looking for a single table which holds the following:-
PO - All PO numbers will begin with the letters' PO' and vary in length.
Note 12345 is not a PO and should not appear in the table.
Here is my current attempt to isolate the numerical values after the PO, but it does not give the correct results.
with
clean_stg as (
select *,
ltrim(SUBSTRING(string,CHARINDEX('po',string)+2,len(string))) Val
from @temp
),
clean_po as (
select b.*,
LEFT(b.Val,PATINDEX('%[^0-9]%', b.Val+'a')-1) PO
from clean_stg b
)
select * from clean_po;
I would love to do this with REGEX in Python, but I need to keep this in the SQL Server environment. I think a recursive Common Table Expression (CTE) might be my answer, but I'm unsure how to structure it best.
I don't think you need recursion here, and the CTE is only useful in that it prevents having to repeat expressions.
;WITH x AS
(
SELECT x = REPLACE(UPPER(string), 'PO ', 'PO')
FROM @temp
)
SELECT PO = TRY_CONVERT(int, SUBSTRING(value,3,255))
FROM x
CROSS APPLY STRING_SPLIT(x, ' ')
WHERE value LIKE 'PO[0-9]%'
ORDER BY PO;
And in fact you can do it without the CTE this way:
SELECT PO = TRY_CONVERT(int, SUBSTRING(value,3,255))
FROM @temp
CROSS APPLY STRING_SPLIT(REPLACE(UPPER(string), 'PO ', 'PO') , ' ')
WHERE value LIKE 'PO[0-9]%'
ORDER BY PO;
It wasn't mentioned but you may want to filter duplicates.
You can do additional safety checks to filter out a value like PO3x5
instead of returning NULL
, but I already think you should have better/cleaner starting data in the first place. Parsing messy, inconsistent strings for nuggets of info is a nightmare.