sqlsql-servert-sql

Find the first occurrence of a string that's NOT within a set of delimiters in SQL Server 2016+


I have a column in a SQL Server table that has strings of varying lengths. I need to find the position of the first occurrence of the string , -- that's not enclosed in single quotes or square brackets.

For example, in the following two strings, I've bolded the portion I would like to get the position of. Notice in the first string, the first time , -- appears on its own (without being between single quote or square bracket delimiters) is at position 13 and in the second string, it's at position 16.

'a, --'[, --]**, --**[, --]

[a, --b]aaaaaaa_ **, --**', --'

Also I should mention that , -- itself could appear multiple times in the string.

Here's a simple query that shows the strings and my desired output.

SELECT 
    t.string, t.desired_pos
FROM
    (VALUES (N'''a, --''[, --], --[, --]', 14),
            (N'[a, —-b]aaaaaaa_ , --'', --''', 18)) t(string, desired_pos)

Is there any way to accomplish this using a SELECT query (or multiple) without using a function?

Thank you in advance!

I've tried variations of SUBSTRING, CHARINDEX, and even some CROSS APPLYs but I can't seem to get the result I'm looking for.


Solution

  • Before I write down my solution, I must warn you: DON'T USE IT. Use a function, or do this in some other language. This code is probably buggy. It doesn't handle stuff like escaped quotes etc etc.

    The idea is to first remove the stuff inside brackets [] and quotes '' and then just do a "simple" charindex. To remove the brackets, I'm using a recursive CTE that loops ever part of matching quotes and replaces their content with placeholder strings.

    One important point is that quotes might be embedded in each other, so you have to try both variants and chose the one that is earliest.

    WITH CTE AS (
        SELECT  *
        FROM
        (VALUES (N'''a, --''[, --], --[, --]', 14),
                (N'[a, —-b]aaaaaaa_ , --'', --''', 18)) t(string, desired_pos)
               )
    , cte2 AS (
        select  x.start
        ,   x.finish
        ,   case when x.start > 0 THEN STUFF(string, x.start, x.finish - x.start + 1, REPLICATE('a', x.finish - x.start + 1)) ELSE string END AS newString
        ,   1 as level
        ,   string as orig
        ,   desired_pos
        from    cte
        CROSS APPLY (
            SELECT  *
            ,   ROW_NUMBER() OVER(ORDER BY case when start > 0 THEN 0 ELSE 1 END, start) AS sortorder
            FROM    (
                SELECT  charindex('[', string) AS start
                ,   charindex(']', string) AS finish
                UNION ALL
                SELECT  charindex('''', string) AS startQ
                ,   charindex('''', string, charindex('''', string) + 1) AS finishQ
            ) x
        ) x
        WHERE   x.sortorder = 1
        
        UNION ALL
        select  x.start
        ,   x.finish
        ,   STUFF(newString, x.start, x.finish - x.start + 1, REPLICATE('a', x.finish - x.start + 1))
        ,   1 as level
        ,   orig
        ,   desired_pos
        from    cte2
        CROSS APPLY (
            SELECT  *
            ,   ROW_NUMBER() OVER(ORDER BY case when start > 0 THEN 0 ELSE 1 END, start) AS sortorder
            FROM    (
                SELECT  charindex('[', newString) AS start
                ,   charindex(']', newString) AS finish
                UNION ALL
                SELECT  charindex('''', newString) AS startQ
                ,   charindex('''', newString, charindex('''', newString) + 1) AS finishQ
            ) x
        ) x
        WHERE   x.sortorder = 1
        AND x.start > 0
        AND cte2.start > 0 -- Must have been a match
    )
    
    SELECT  PATINDEX('%, --%', newString), *
    from (
        select *, row_number() over(partition by orig order by level desc) AS sort
        from cte2
        ) x
    where x.sort = 1