sqlsql-servert-sqlreplacesubstring

Split a string based on multiple variable substring


I am trying to clean some data in SQL Server (v15.0.18384.0) but cannot figure out how to split a string with a variable substring.

I have a column with multiple rows of single long strings. Each row is a different string but they follow something like:

"1.    aaaaaaaaaaaaaaaaaaaaaaaaa.2.    bbbbbbbbbbbbbbbbbbbbbbbbbbbb.3.    cccccccccccccccccccccccccccccccc.4.    dddddddddddddddddddddddddddddddd.5.    adfafasdfasf.[1][1]."

The aaaa, bbbb, ... values are sentences that may or may not have numbers in them as well. As you can guess, I'd like to break this long string into separate strings, broken apart by the "digit period" combo (the extra spaces in front are an annoyance too but I can trim that after I break the sentences apart).

Does anyone have any input on what I can do? I'd love nothing more than to have 5 new records where the other records in other columns repeat.

EX)

Column A Column B
Hello aaaaaaaaaaaaaaaaaaaaaaaa.
Hello bbbbbbbbbbbbbbbbbbbbbbbb.
Hello cccccccccccccccccccccccc.
Hello dddddddddddddddddddddddd.
Hello adfafasdfasf.[1][1].

My first instinct was to use string_split() but I do not believe I can use a wildcard or multiple values as the "replace this" sub-string...same with Replace(). I then searched online and found something similar to the following so I could get a single value (@) and then use that in the string_split():

Function [dbo].[RemoveNumbers](@Temp VarChar(8000))
Returns VarChar(8000)
AS
Begin

    Declare @FindNum as varchar(4) = '%[0-9].%' --Pretty sure this is just a literal
    While PatIndex(@FindNum, @Temp) != 0
        Set @Temp = Stuff(@Temp, PatIndex(@FindNum, @Temp), 2, '@')

    Return @Temp
End

but when I look at the results after I input the function into a query, nothing changed, i.e. the digit period combo was still there...no @'s in sight. I saw RegEx suggested as an option but not sure if that works in the SQL I'm using as I've gotten the "IDK what you're talking about" error. I'll do the nested replace if I have to but still unsure how to incorporate it so it iterates through the entire string and column.


Solution

  • You need to declare @FindNum with sufficient length to hold the pattern '%[0-9].%', which has 8 characters.

    Function [dbo].[RemoveNumbers](@Temp VarChar(8000))
    Returns VarChar(8000)
    AS
    Begin
    
        Declare @FindNum as varchar(8) = '%[0-9].%' --Pretty sure this is just a literal
        While PatIndex(@FindNum, @Temp) != 0
            Set @Temp = Stuff(@Temp, PatIndex(@FindNum, @Temp), 2, '@')
    
        Return @Temp
    End
    

    If you need to handle item numbers with two or more digits, you may need to add something like the following to consume the extra digits.

        Declare @ExtraDigits as varchar(8) = '%[0-9]@%'
        ...
        While PatIndex(@ExtraDigits, @Temp) != 0
            Set @Temp = Stuff(@Temp, PatIndex(@ExtraDigits, @Temp), 2, '@')
    

    You may still need to trim spaces and eliminate the leading empty element. If there is a chance that one of the strings to be extracted ends with a digit instead of a period (or any other character) you may have to address potential ambiguities. It would be much better to fix the data at the source.

    See this db<>fiddle for a demo.