sqlsql-serversubstringcharindex

SQL select substring which is before the specified words


I'm attempting to SELECT a substring from a larger string that should end before the words "STORE" or "PROTECT".

The problem is that there are a few scenarios in which the words "STORE" or "PROTECT" can appear before or after each other in a string, and we need to extract the substring that appears before either of those two words.

Example Strings:

  1. "UNLABELLEDa ABCD EFGH IJKLM FOO BAR, STORE AT 15-30°C." - Here there is no "PROTECT" but "STORE" word exists, so we should get a substring before "STORE".

  2. "STORE UP TO 30°C (86°F). PROTECT FROM MOISTURE" - In this string there is no string before first occurrence of a word "STORE" so substring will be BLANK

  3. "MEPO BKHGT FOO BARFOO BAGK VIAL. PROTECT FROM LIGHT, STORE BETWEEN X°C TO Y°C." - Query should return the substring before the word "PROTECT".

  4. "CARTON CONTAINING 1 X VIAL SB245063 100MG LYOPHILISED POWDER FOR RECONSTITUION. STORE AT 2°C - 8°C. PROTECT FROM LIGHT." - In this string "STORE" word is exist before "PROTECT", so query should return a substring before the word "STORE".

  5. NULL - when the string is null query should return a empty value instead of error.

SQL Query:

DECLARE @TestVariable AS VARCHAR(MAX)='UNLAB ABCFG DRETFG FOO BARFOO 3990MG TABLETS, STORE AT 15-30°C.'

SELECT case when CHARINDEX('STORE', upper(@TestVariable)) > CHARINDEX('PROTECT', upper(@TestVariable))
then SUBSTRING(@TestVariable, 1, CHARINDEX('PROTECT', upper(@TestVariable))-1)
else SUBSTRING(@TestVariable, 1, CHARINDEX('STORE', upper(@TestVariable))-1)
end pack_description

If there are no words in a string from "PROTECT" or "STORE" and the string value is NULL, the query returns an error.

What changes should be made to this query based on the examples above to meet our requirements?


Solution

  • SQL Server is a poor platform to do text parsing because of it's very limited pattern matching capabilities. This would be better done in an application that supports regular expressions (regexes), because regexes can better check for context like word boundaries with expressions like /\bSTORE\b/.

    However, if you need to do this in SQL server, you can add LEN() as a third position selector and LEAST() to chose the leftmost trim position. The NULLIF() function can also be used to map not-found 0 positions to null, so that they will be ignored. This also uses LEFT() instead of SUBSTRING().

    SELECT
        LEFT(D.String, LEAST(
            NULLIF(CHARINDEX('STORE', D.String), 0) - 1,
            NULLIF(CHARINDEX('PROTECT', D.String), 0) - 1,
            LEN(D.String)
            )) AS Result
    FROM Data D
    

    Note that LEAST() is a new function only available in SQL Server 2022 and later. (Keep reading for alternatives that work in earlier versions.)

    Purely for aesthetic reasons, a CROSS APPLY can be used to separate the position calculation from the final select.

    SELECT
         LEFT(D.String, A.TrimPos) AS Result
    FROM DATA D. 
    CROSS APPLY (
        SELECT LEAST(
            NULLIF(CHARINDEX('STORE', D.String), 0) - 1,
            NULLIF(CHARINDEX('PROTECT', D.String), 0) - 1,
            LEN(D.String)
            ) AS TrimPos
    ) A
    

    For earlier SQL versions that don't have the LEAST() function, use:

    ...
    CROSS APPLY (
        SELECT MIN(Pos) AS TrimPos
        FROM (
            VALUES
                (NULLIF(CHARINDEX('STORE', D.String), 0) - 1),
                (NULLIF(CHARINDEX('PROTECT', D.String), 0) - 1),
                (LEN(D.String))
        ) V(Pos)
    ) A
    

    Keywords can also be placed in a table and use a subselect to search for any keyword. The ISNULL() function is used to inject LEN() as the default trim position.

    SELECT
         LEFT(D.String, A.TrimPos) AS Result
    FROM Data D
    CROSS APPLY (
        SELECT ISNULL(MIN(P.Pos), LEN(D.String)) AS TrimPos
        FROM (
            SELECT NULLIF(CHARINDEX(K.Keyword, D.String), 0) - 1 AS Pos
            FROM Keywords K
        ) P
    ) A
    

    Finally, the current match logic can yield false hits if the text contains larger words like "protection" or "restorer" that partially match the keywords. This can be resolved by using PATINDEX() and prefixing the keywords with the not-a-letter pattern [^A-Z].

    SELECT
         LEFT(D.String, A.TrimPos) AS Result
    FROM Data D
    CROSS APPLY (
        SELECT ISNULL(MIN(P2.AdjustedPos), LEN(D.String)) AS TrimPos
        FROM (
            SELECT NULLIF(PATINDEX('%[^A-Z]' + K.Keyword + '[^A-Z]%', ' ' + D.String + ' '), 0) - 2 AS Pos
            FROM Keywords K
        ) P
        CROSS APPLY (
            SELECT CASE WHEN P.Pos < 0 THEN 0 ELSE P.Pos END AS AdjustedPos
        ) P2
    ) A
    

    Sample results

    Result Tail
    UNLABELLEDa ABCD EFGH IJKLM FOO BAR,  STORE AT 15-30°C.
    STORE UP TO 30°C (86°F). PROTECT FROM MOISTURE
    MEPO BKHGT FOO BARFOO BAGK VIAL.  PROTECT FROM LIGHT, STORE BETWEEN X°C TO Y°C.
    CARTON CONTAINING 1 X VIAL SB245063 100MG LYOPHILISED POWDER FOR RECONSTITUION.  STORE AT 2°C - 8°C. PROTECT FROM LIGHT.
    Just a description here
    Finish restorer
    Paint protection
    Bananas.  Discard after 90 days.
    null
    null null

    See this db<>fiddle for a demo of each of the above.