I need to find frequency of sequences of data rows.I have like 17000 rows data which is including almost 120 different kind of data and I need to find which sequences of data how many times repeated?
For instance:
a
b
c
a
b
d
a
b
c
I am trying to find repetition order some people call it frequency sequence. So how many times aa and abc and ab and bc and abca and so on are there in this column? I mean I need to find how many times this data has same group of rows.
For this example it has 4 different data so there are a lot of combination of it. For calculation: C(4,1)*4!+C(4,2)*2!+C(4,3)*3!+C(4,4) different order and I need to calculate it for each order how many times are there?
My short part of real column data example:(each contiguous data is equal to a row)
3E010000
2010000
2010007
2010008
2010000
2010003
2010009
0201000A
0B01000C
2010002
3E010000
2010000
2010007
0B010014
2010009
0201000A
0B01000C
2010002
Now if you could check this whole main column for this group of data:
3E010000
2010000
2010007
and this
3E010000
2010000
and this
2010009
0201000A
0B01000C
2010002
and so on. You can see they were repeated more than once.
These rows repeat in the main first column and I am trying to find 1,2,3,4 and max 5 groups of orders out of 120 kind of data combination.
I am using Microsoft SQL Server 2014. But if it is not possible in Microsoft SQL Server, then you could give me any advice or other tools. Could you help me please? Thank you so much!
Output:
0B010009 ,0B010009,0B010009,2010005,2010005,2010005 2 9
0B010014 ,0B010014,0B010014,16010002,16010002,16010002 2 3
2010002,2010002,0201FFE0,0201FFE0 2 13
0B0114B5 ,0B0114B5,0B0114B5,2010002,2010002,2010002,2010004,2010004,2010004 3 3
070105B3 ,070105B3,070105B3,2010005,2010005,2010005,0201FFE1 ,0201FFE1,0201FFE1
3 2
3E010000 ,3E010000,3E010000,0B010010,0B010010,0B010010 ,0B01F61D ,0B01F61D,0B01F61D 3 6
3E010002 ,3E010002,3E010002,0B010013,0B010013,0B010013 ,0B01F80D ,0B01F80D,0B01F80D 3 3
0B010003 ,0B010003,2010006,2010006,0B01000A ,0B01000A,2010005,2010005 4 2
0B01FFE1 ,0B01FFE1,0B01FFE1,0B010013,0B010013,0B010013 ,0B01EAD0 ,0B01EAD0,0B01EAD0,0B010004,0B010004,0B010004 4 4
0B01000C ,0B01000C,0B01000C,0B01FCBD,0B01FCBD,0B01FCBD ,0701FFE0 ,0701FFE0,0701FFE0,0B01000A,0B01000A,0B01000A 4 5
Below query finds duplicate patterns for 2, 3, 4 and 5 repeating rows.
It uses 'LEAD' and 'HASHBYTES' functions.
Query works by computing a hash sequence for values in current row + following rows and then grouping on these hash values to find "duplicate" patterns. This process is done for each row.
Note: an ever increasing sequence column (to denote row position) i.e. ID is assumed.
CREATE TABLE #Data( ID INT IDENTITY PRIMARY KEY, Val VARCHAR( 20 ))
INSERT INTO #Data
VALUES
( '3E010000' ), ( '2010000' ), ( '2010007' ), ( '2010008' ), ( '2010000' ),
( '2010003' ), ( '2010009' ), ( '0201000A' ), ( '0B01000C' ), ( '2010002' ),
( '3E010000' ), ( '2010000' ), ( '2010007' ), ( '0B010014' ), ( '2010009' ),
( '0201000A' ), ( '0B01000C' ), ( '2010002' )
SELECT Pat3Rows, COUNT(*) AS Cnt
FROM(
SELECT *,
HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID )) AS Pat2Rows,
HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + LEAD( Val, 2, '' ) OVER( ORDER BY ID )) AS Pat3Rows,
HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + LEAD( Val, 3, '' ) OVER( ORDER BY ID )) AS Pat4Rows,
HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + LEAD( Val, 3, '' ) OVER( ORDER BY ID ) + LEAD( Val, 4, '' ) OVER( ORDER BY ID )) AS Pat5Rows
FROM #Data AS D1
) AS HashedGroups
GROUP BY Pat3Rows
HAVING COUNT(*) > 1
Note: there is a possibility, albeit extremely remote, of encountering hash collisions, so the above logic is not guaranteed to handle all theoretically possible cases. In summary I would not recommend using it if someone's life depends on the procedure to always be 100% accurate.
You did not specify how the output should look like so I will leave this up to you.
I have also tested this on my laptop with 18,000 rows and it produces a result in less than 1 second.
Sample use case:
;WITH DataHashed AS(
SELECT *,
HASHBYTES( 'MD5', Val + ',' + LEAD( Val, 1, '' ) OVER( ORDER BY ID )) AS Pat2Rows,
HASHBYTES( 'MD5', Val + ',' + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 2, '' ) OVER( ORDER BY ID )) AS Pat3Rows,
HASHBYTES( 'MD5', Val + ',' + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 3, '' ) OVER( ORDER BY ID )) AS Pat4Rows,
HASHBYTES( 'MD5', Val + ',' + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 3, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 4, '' ) OVER( ORDER BY ID )) AS Pat5Rows
FROM #Data ),
RepeatingPatterns AS(
SELECT MIN( ID ) AS FirstRow, Pat2Rows AS PatternHash, 2 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat2Rows HAVING COUNT(*) > 1
UNION ALL
SELECT MIN( ID ) AS FirstRow, Pat3Rows, 3 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat3Rows HAVING COUNT(*) > 1
UNION ALL
SELECT MIN( ID ) AS FirstRow, Pat4Rows, 4 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat4Rows HAVING COUNT(*) > 1
UNION ALL
SELECT MIN( ID ) AS FirstRow, Pat5Rows, 5 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat5Rows HAVING COUNT(*) > 1
)
--SELECT * FROM RepeatingPatterns
SELECT
CONVERT( VARCHAR( 50 ), SUBSTRING(
( SELECT ',' + D.Val AS [text()]
FROM #Data AS D
WHERE RP.FirstRow <= D.ID AND D.ID < ( RP.FirstRow + RP.PatternSize )
ORDER BY D.ID
FOR XML PATH ('')
), 2, 1000 )) AS Pattern, CONVERT( VARCHAR( 35 ), PatternHash, 1 ) AS PatternHash, RP.PatternSize, Cnt
FROM RepeatingPatterns AS RP
Sample output:
Pattern PatternHash PatternSize Cnt
-------------------------------------------------- ----------------------------------- ----------- -----------
0201000A,0B01000C 0x499D8B1750A9BF57795B4D60D58DCF81 2 2
2010000,2010007 0x7EDE1E675D934F3035DACAC53F74DD14 2 2
3E010000,2010000 0x85FBFD817CFBB9BD08E983671EB594B7 2 2
2010009,0201000A 0x8E18E36B989BD859AF039238711A7F8C 2 2
0B01000C,2010002 0xF1EABB115FB3AEF2D162FB3EC7B6AFDA 2 2
0201000A,0B01000C,2010002 0x6DE203B38A13501881610133C1EDBF85 3 2
2010009,0201000A,0B01000C 0x9EB3ACFE8580A39FC530C7CA54830602 3 2
3E010000,2010000,2010007 0xE414661F54C985B7ED9FA82FF05C1219 3 2
2010009,0201000A,0B01000C,2010002 0x7FCDB748E37A6F6299AE8B269A4B0E49 4 2