sql sql-server sequence frequency mining

How can I count data groups sequential order in the same column in SQL Server 2014?

I need to find frequency of sequences of data rows.I have like 17000 rows data which is including almost 120 different kind of data and I need to find which sequences of data how many times repeated?

For instance:

a
b
c
a
b
d
a
b
c

I am trying to find repetition order some people call it frequency sequence. So how many times aa and abc and ab and bc and abca and so on are there in this column? I mean I need to find how many times this data has same group of rows.

For this example it has 4 different data so there are a lot of combination of it. For calculation: C(4,1)*4!+C(4,2)*2!+C(4,3)*3!+C(4,4) different order and I need to calculate it for each order how many times are there?

My short part of real column data example:(each contiguous data is equal to a row)

Now if you could check this whole main column for this group of data:

3E010000 
2010000
2010007

and this

3E010000 
2010000

and this

and so on. You can see they were repeated more than once.

These rows repeat in the main first column and I am trying to find 1,2,3,4 and max 5 groups of orders out of 120 kind of data combination.

I am using Microsoft SQL Server 2014. But if it is not possible in Microsoft SQL Server, then you could give me any advice or other tools. Could you help me please? Thank you so much!

Output:

0B010009 ,0B010009,0B010009,2010005,2010005,2010005     2   9

0B010014 ,0B010014,0B010014,16010002,16010002,16010002     2    3    

2010002,2010002,0201FFE0,0201FFE0       2   13    

0B0114B5 ,0B0114B5,0B0114B5,2010002,2010002,2010002,2010004,2010004,2010004    3    3    

070105B3 ,070105B3,070105B3,2010005,2010005,2010005,0201FFE1 ,0201FFE1,0201FFE1 
   3    2    

3E010000 ,3E010000,3E010000,0B010010,0B010010,0B010010 ,0B01F61D ,0B01F61D,0B01F61D     3   6    

3E010002 ,3E010002,3E010002,0B010013,0B010013,0B010013 ,0B01F80D ,0B01F80D,0B01F80D    3    3    

0B010003 ,0B010003,2010006,2010006,0B01000A ,0B01000A,2010005,2010005   4   2    

0B01FFE1 ,0B01FFE1,0B01FFE1,0B010013,0B010013,0B010013 ,0B01EAD0 ,0B01EAD0,0B01EAD0,0B010004,0B010004,0B010004     4    4     

0B01000C ,0B01000C,0B01000C,0B01FCBD,0B01FCBD,0B01FCBD ,0701FFE0 ,0701FFE0,0701FFE0,0B01000A,0B01000A,0B01000A     4    5

Solution

Below query finds duplicate patterns for 2, 3, 4 and 5 repeating rows.

It uses 'LEAD' and 'HASHBYTES' functions.

Query works by computing a hash sequence for values in current row + following rows and then grouping on these hash values to find "duplicate" patterns. This process is done for each row.

Note: an ever increasing sequence column (to denote row position) i.e. ID is assumed.

CREATE TABLE #Data( ID INT IDENTITY PRIMARY KEY, Val VARCHAR( 20 ))
INSERT INTO #Data
VALUES
( '3E010000' ), ( '2010000' ), ( '2010007' ), ( '2010008' ), ( '2010000' ),
( '2010003' ), ( '2010009' ), ( '0201000A' ), ( '0B01000C' ), ( '2010002' ),
( '3E010000' ), ( '2010000' ), ( '2010007' ), ( '0B010014' ), ( '2010009' ),
( '0201000A' ), ( '0B01000C' ), ( '2010002' )


SELECT Pat3Rows, COUNT(*) AS Cnt
FROM(
    SELECT *,
        HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID )) AS Pat2Rows,
        HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + LEAD( Val, 2, '' ) OVER( ORDER BY ID )) AS Pat3Rows,
        HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + LEAD( Val, 3, '' ) OVER( ORDER BY ID )) AS Pat4Rows,
        HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + LEAD( Val, 3, '' ) OVER( ORDER BY ID ) + LEAD( Val, 4, '' ) OVER( ORDER BY ID )) AS Pat5Rows
    FROM #Data AS D1
) AS HashedGroups
GROUP BY Pat3Rows
HAVING COUNT(*) > 1

Note: there is a possibility, albeit extremely remote, of encountering hash collisions, so the above logic is not guaranteed to handle all theoretically possible cases. In summary I would not recommend using it if someone's life depends on the procedure to always be 100% accurate.

You did not specify how the output should look like so I will leave this up to you.

I have also tested this on my laptop with 18,000 rows and it produces a result in less than 1 second.

Sample use case:

;WITH DataHashed AS(
SELECT *,
    HASHBYTES( 'MD5', Val + ',' +  LEAD( Val, 1, '' ) OVER( ORDER BY ID )) AS Pat2Rows,
    HASHBYTES( 'MD5', Val + ',' +  LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 2, '' ) OVER( ORDER BY ID )) AS Pat3Rows,
    HASHBYTES( 'MD5', Val + ',' +  LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + ',' +  LEAD( Val, 3, '' ) OVER( ORDER BY ID )) AS Pat4Rows,
    HASHBYTES( 'MD5', Val + ',' +  LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + ',' +  LEAD( Val, 3, '' ) OVER( ORDER BY ID ) + ',' +  LEAD( Val, 4, '' ) OVER( ORDER BY ID )) AS Pat5Rows
FROM #Data ),
RepeatingPatterns AS(
    SELECT MIN( ID ) AS FirstRow, Pat2Rows AS PatternHash, 2 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat2Rows HAVING COUNT(*) > 1
    UNION ALL
    SELECT MIN( ID ) AS FirstRow, Pat3Rows, 3 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat3Rows HAVING COUNT(*) > 1
    UNION ALL
    SELECT MIN( ID ) AS FirstRow, Pat4Rows, 4 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat4Rows HAVING COUNT(*) > 1
    UNION ALL
    SELECT MIN( ID ) AS FirstRow, Pat5Rows, 5 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat5Rows HAVING COUNT(*) > 1
)
--SELECT * FROM RepeatingPatterns
SELECT 
     CONVERT( VARCHAR( 50 ), SUBSTRING(
        ( SELECT ',' + D.Val  AS [text()]
        FROM #Data AS D
        WHERE RP.FirstRow <= D.ID AND D.ID < ( RP.FirstRow + RP.PatternSize )
        ORDER BY D.ID
        FOR XML PATH ('')
        ), 2, 1000 )) AS Pattern, CONVERT( VARCHAR( 35 ), PatternHash, 1 ) AS PatternHash, RP.PatternSize, Cnt
FROM RepeatingPatterns AS RP

Sample output:

Pattern                                            PatternHash                         PatternSize Cnt
-------------------------------------------------- ----------------------------------- ----------- -----------
0201000A,0B01000C                                  0x499D8B1750A9BF57795B4D60D58DCF81  2           2
2010000,2010007                                    0x7EDE1E675D934F3035DACAC53F74DD14  2           2
3E010000,2010000                                   0x85FBFD817CFBB9BD08E983671EB594B7  2           2
2010009,0201000A                                   0x8E18E36B989BD859AF039238711A7F8C  2           2
0B01000C,2010002                                   0xF1EABB115FB3AEF2D162FB3EC7B6AFDA  2           2
0201000A,0B01000C,2010002                          0x6DE203B38A13501881610133C1EDBF85  3           2
2010009,0201000A,0B01000C                          0x9EB3ACFE8580A39FC530C7CA54830602  3           2
3E010000,2010000,2010007                           0xE414661F54C985B7ED9FA82FF05C1219  3           2
2010009,0201000A,0B01000C,2010002                  0x7FCDB748E37A6F6299AE8B269A4B0E49  4           2