sqlsql-serversql-server-2016

Select substring of multiple rows with similar values


I am trying to select same substring of multiple rows without specifying the substring first.
Basically find the common part in a string. The common parts in a string range from 2 to 5 words.

Here is what I am trying to achieve:

The goal is to find the longest/last common denominator of multiple rows, creating a group / a set. The substring at the end of a string that is unique to a single row in a set should be trimmed.

Rows with that have same substring from left are considered part of a set.

"Item A 1" and "Item A 2" are part of "Item A" set.

"Item B 1" and "Item B 2" are part of "Item B" set.

The sets can include multiple parts from string.

"Item A A 1" and "Item A A 2" are part of "Item A A" set.

"Item A B 1" and "Item A B 2" are part of "Item A B" set.

Also value of the row might be just the name of a set.

"Item A" and "Item A 1" are part of "Item A" set.

"Item A B" and "Item A B 1" are part of "Item A B" set.

id item set
1 DEP Item 1 DEP Item
2 DEP Item 10 DEP Item
3 DEP Item 11 DEP Item
4 DEP Item 24 DEP Item
5 DEP G1 Item B 1 DEP G1 Item B
6 DEP G1 Item B 10 DEP G1 Item B
7 DEP G1 Item B 11 DEP G1 Item B
8 DEP G1 Item B 24 DEP G1 Item B
9 DEP A Item B 1 1 A DEP A Item B 1
10 DEP A Item B 1 10 DEP A Item B 1
11 DEP A Item B 1 11 B DEP A Item B 1
12 DEP A Item B 1 24 A B C DEP A Item B 1

Here is sample data and a query I've been trying, which sort of gives correct results, but not in the required way.

CREATE TABLE #temp (
    id INT,
    item NVARCHAR(50)
);


INSERT INTO #temp (id, item) VALUES
(1,'DEP Item 1'),
(2,'DEP Item 10'),
(3,'DEP Item 11'),
(4,'DEP Item 24'),
(5,'DEP G1 Item B 1'),
(6,'DEP G1 Item B 10'),
(7,'DEP G1 Item B 11'),
(8,'DEP G1 Item B 24'),
(9,'DEP A Item B 1 1 A'),
(10,'DEP A Item B 1 10'),
(11,'DEP A Item B 1 11 B'),
(12,'DEP A Item B 1 24 A B C')

select *,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 1 THEN item
    ELSE LEFT(item, CHARINDEX(' ', item,
        CHARINDEX(' ', item)+1))
    end,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 2 THEN item
    ELSE LEFT(item, CHARINDEX(' ', item,
        CHARINDEX(' ', item,
        CHARINDEX(' ', item)+1)+1))
    end,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 4 THEN item
    ELSE LEFT(item, CHARINDEX(' ', item,
        CHARINDEX(' ', item,
        CHARINDEX(' ', item,
        CHARINDEX(' ', item)+1)+1)+1))
    end,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 5 THEN item
    ELSE LEFT(item, CHARINDEX(' ', item,
        CHARINDEX(' ', item,
        CHARINDEX(' ', item,
        CHARINDEX(' ', item,
        CHARINDEX(' ', item)+1)+1)+1)+1))
    end
from #temp

Solution

  • First, we make up all possible combinations of word sets for each id(item) where order of words is remains - this is path's.
    6, DEP G1 Item B 1 ->(DEP), (DEP,G1), (DEP,G1,Item), (DEP,G1,Item,B), (DEP,G1, Item,B,1)
    9, DEP G1 Item B 11->(DEP), (DEP,G1), (DEP,G1,Item), (DEP,G1,Item,B), (DEP,G1, Item,B,11)

    Then select rows with common path's - thru comparing path's for every item with all item's path's - select rows with inner self join on t1.path=t2.path and t1.id<>t2.id.

    And last - take greatest path's (order by count of words in path desc) for items.

    See example

    with r as(
      select id,1 n,item
         ,case when charindex(' ',item)>0 then  
            trim(substring(item,charindex(' ',item)+1,100))
          else ''
          end rest
         ,cast(case when charindex(' ',item)>0 then  
            trim(substring(item,1,charindex(' ',item)-1))
          else trim(item)
          end  as varchar(100)) path
    from #temp
      union all
      select id,n+1 n,item
         ,case when charindex(' ',rest)>0 then  
            trim(substring(rest,charindex(' ',rest)+1,100))
          else ''
          end rest
         ,cast(concat(path,','
           ,case when charindex(' ',rest)>0 then  
            trim(substring(rest,1,charindex(' ',rest)-1))
            else trim(rest)
            end) as varchar(100)) path
    from r where len(rest)>0
    )
    select s.path,n,t.id,t.item
    from #temp t 
    left join (
       select  t1.id,t1.path,t1.n
         ,row_number()over(partition by t1.id order by t1.n desc) rn
       from r t1
       inner join r t2 on t1.path=t2.path and t1.id<>t2.id
      ) s on s.rn=1 and s.id=t.id
    order by n,path,t.id
    

    output is

    path (set) n id item
    DEP 1 1 DEP
    DEP,Item 2 2 DEP Item 1
    DEP,Item 2 3 DEP Item 10
    DEP,Item 2 5 DEP Item 24
    DEP,G1,Item,B 4 6 DEP G1 Item B 1
    DEP,G1,Item,B 4 8 DEP G1 Item B 11
    DEP,G1,Item,B 4 9 DEP G1 Item B 24
    DEP,A,Item,B,1 5 11 DEP A Item B 1 10
    DEP,A,Item,B,1 5 13 DEP A Item B 1 24 A B C
    DEP,A,Item,B,1,1,A 7 10 DEP A Item B 1 1 A
    DEP,A,Item,B,1,1,A 7 14 DEP A Item B 1 1 A B C
    null null 15 Hello world

    There item id=15 not have common words with other rows.

    on test data

    CREATE TABLE #temp (id INT, item NVARCHAR(50));
    INSERT INTO #temp (id, item) VALUES
    (1,'DEP'),
    (2,'DEP     Item 1'),
    (3,'DEP Item 10'),
    --(4,'DEP Item 11'),
    (5,'DEP Item 24'),
    (6,'DEP G1 Item B 1'),
    --(7,'DEP G1 Item B 10'),
    (8,'DEP G1 Item B 11'),
    (9,'DEP G1 Item B 24'),
    (10,'DEP A Item B 1 1 A'),
    (11,'DEP A Item B 1 10'),
    --(12,'DEP A Item B 1 11 B'),
    (13,'DEP A Item B 1 24 A B C'),
    (14,'DEP A Item B 1 1 A B C'),
    (15,'Hello world')
    
    ;
    

    Fiddle