sql-servert-sqlsql-server-2008-r2soundexpatindex

Optimal search string in the where clause


Want to search the string using PATINDEX and SOUNDEX within the WHERE clause or any optimal way.

I have the following table with some sample data to search the given string using PATINDEX and SOUNDEX.

create table tbl_pat_soundex
(
    col_str varchar(max)
);

insert into tbl_pat_soundex values('Smith A Steve');
insert into tbl_pat_soundex values('Steve A Smyth');
insert into tbl_pat_soundex values('A Smeeth Stive');
insert into tbl_pat_soundex values('Steve Smith A');
insert into tbl_pat_soundex values('Smit Steve A');

Note: I have 100 Millions of records in the table to search for.

String to search:- 'Smith A Steve'

SELECT col_str
FROM tbl_pat_soundex
WHERE PATINDEX('%Smith%',col_str) >= 1 AND PATINDEX('%A%',col_str) >= 1 AND PATINDEX('%Steve%',col_str) >= 1

Getting Output:

col_str
--------------
Smith A Steve
Steve Smith A

Expected Output:

col_str         
----------------
Smith A Steve   
Steve A Smyth   
A Smeeth Stive  
Steve Smith A   
Smit Steve A    

Tried:

1:

SELECT col_str
FROM tbl_pat_soundex
WHERE PATINDEX('%Smith%',col_str) >= 1 AND 
      PATINDEX('%A%',col_str) >= 1 AND 
      PATINDEX('%Steve%',col_str) >= 1

2:

SELECT col_str
FROM tbl_pat_soundex
WHERE PATINDEX('%'+SOUNDEX('Smith')+'%',SOUNDEX(col_str)) >= 1 AND 
      PATINDEX('%'+SOUNDEX('A')+'%',SOUNDEX(col_str)) >= 1 AND 
      PATINDEX('%'+SOUNDEX('Steve')+'%',SOUNDEX(col_str)) >= 1

3:

SELECT col_str
FROM tbl_pat_soundex    
WHERE DIFFERENCE('Smith',col_str) = 4 AND 
      DIFFERENCE('A',col_str) =4 AND 
      DIFFERENCE('Steve',col_str) = 4

4:

--Following was taking huge time(was kept running more than 20 minutes) to execute.
SELECT DISTINCT col_str
FROM tbl_pat_soundex [a]
CROSS APPLY SplitString([a].[col_str], ' ') [b]
WHERE DIFFERENCE([b].Item,'Smith') >= 1 AND 
      DIFFERENCE([b].Item,'A') >= 1 AND 
      DIFFERENCE([b].Item,'Steve') >= 1

Solution

  • With such a lot of rows the only hint I can give you is: Change the design. Each name part should live in a separate column...

    The following will work, but I promise it will be slow...

    --set up a test db

    USE master;
    GO
    CREATE DATABASE shnugo;
    GO
    USE shnugo;
    GO
    

    --your table, I added an ID-column

    create table tbl_pat_soundex
    (
        ID INT IDENTITY --needed to distinguish rows
       ,col_str varchar(max)
    );
    GO
    

    --A function, which will return a blank-separated string as a alphabetically sorted list of distinct soundex values separated by /: "Smith A Steve" comes back as /A000/S310/S530/

    CREATE FUNCTION dbo.ComputeSoundex(@str VARCHAR(MAX))
    RETURNS VARCHAR(MAX)
    AS
    BEGIN
        DECLARE @tmpXML XML=CAST('<x>' + REPLACE((SELECT @str AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML);
        RETURN (SELECT DISTINCT '/' + SOUNDEX(x.value('text()[1]','varchar(max)')) AS [se]
                FROM @tmpXML.nodes('/x[text()]') A(x)
                ORDER BY se
                FOR XML PATH(''),TYPE).value('.','nvarchar(max)') + '/';
    END
    GO
    

    --Add a column to store a computed soundex-chain permanently

    ALTER TABLE tbl_pat_soundex ADD SortedSoundExPattern VARCHAR(MAX);
    GO
    

    --We need a trigger to maintain the computed soundex-chain on any insert or update

    CREATE TRIGGER RefreshComputeSoundex ON tbl_pat_soundex
    FOR INSERT,UPDATE
    AS
    BEGIN
        UPDATE s SET SortedSoundExPattern=dbo.ComputeSoundex(i.col_str)
        FROM tbl_pat_soundex s
        INNER JOIN inserted i ON s.ID=i.ID;
    END
    GO
    

    --test data

    insert into tbl_pat_soundex(col_str) values
     ('Smith A Steve')
    ,('Steve A Smyth')
    ,('A Smeeth Stive')
    ,('Steve Smith A')
    ,('Smit Steve A')
    ,('Smit Steve') --no A
    ,('Smit A') --no Steve
    ,('Smit Smith Robert Peter A') --add noise
    ,('Shnugo'); --something else entirely
    

    --check the intermediate result

    SELECT * 
    FROM tbl_pat_soundex
    
    /*
    +----+---------------------------+-----------------------+
    | ID | col_str                   | SortedSoundExPattern  |
    +----+---------------------------+-----------------------+
    | 1  | Smith A Steve             | /A000/S310/S530/      |
    +----+---------------------------+-----------------------+
    | 2  | Steve A Smyth             | /A000/S310/S530/      |
    +----+---------------------------+-----------------------+
    | 3  | A Smeeth Stive            | /A000/S310/S530/      |
    +----+---------------------------+-----------------------+
    | 4  | Steve Smith A             | /A000/S310/S530/      |
    +----+---------------------------+-----------------------+
    | 5  | Smit Steve A              | /A000/S310/S530/      |
    +----+---------------------------+-----------------------+
    | 6  | Smit Steve                | /S310/S530/           |
    +----+---------------------------+-----------------------+
    | 7  | Smit A                    | /A000/S530/           |
    +----+---------------------------+-----------------------+
    | 8  | Smit Smith Robert Peter A | /A000/P360/R163/S530/ |
    +----+---------------------------+-----------------------+
    | 9  | Shnugo                    | /S520/                |
    +----+---------------------------+-----------------------+
    */
    

    --Now we can start to search:

    DECLARE @StringToSearch VARCHAR(MAX)=' A Steve';
    
    WITH SplittedSearchString AS
    (
        SELECT soundexCode.value('text()[1]','nvarchar(max)') AS SoundExCode
        FROM (SELECT CAST('<x>' + REPLACE(dbo.ComputeSoundex(@StringToSearch),'/','</x><x>') + '</x>' AS XML)) A(x)
        CROSS APPLY x.nodes('/x[text()]') B(soundexCode)
    )
    SELECT a.ID,col_str
    FROM tbl_pat_soundex a
    INNER JOIN SplittedSearchString s On SortedSoundExPattern LIKE '%/' +  s.SoundExCode + '/%'
    GROUP BY ID,col_str
    HAVING COUNT(ID)=(SELECT COUNT(*) FROM SplittedSearchString)
    ORDER BY ID 
    GO
    

    --clean-up

    USE master;
    GO
    DROP DATABASE shnugo;
    

    Short explanation

    This is how it works:

    And a final hint: If you want to search for an exact match, but you want to include different writings you can just directly compare the two strings. You might even place an index on the new column SortedSoundExPattern. Due to the way of creation all kinds of "Steven A Smith", "Steeven a Smit" and even in differing order like "Smith Steven A" will produce exactly the same pattern.