Sorry if this has already been asked. I see a lot of similar questions but none exactly like this one.
I am trying to de-dup a large set (about 500 M) records:
Sample data:
CUST_ID PROD_TYPE VALUE DATE
------------------------------------
1 1 Y 5/1/2015 *
1 2 N 5/1/2015 *
1 1 N 5/2/2015 *
1 2 N 5/2/2015
1 1 Y 5/3/2015 *
1 2 Y 5/3/2015 *
1 1 Y 5/6/2015
1 2 N 5/6/2015 *
By CUST_ID
and PROD_TYPE
, I need to retain the initial records as well as any records having a changed VALUE
(the records with the asterisks). There can sometimes be gaps between the dates. There are around 5M unique CUST_ID
's.
Any help would be greatly appreciated.
Not sure why LAG isn't working for you, this returns your results:
with t as (
select 1 as CUST_ID, 1 as PROD_TYPE, 'Y' as VALUE, '5/1/2015' as [Date]
union
select 1, 2, 'N', '5/1/2015'
union
select 1, 1, 'N', '5/2/2015'
union
select 1, 2, 'N', '5/2/2015'
union
select 1,1, 'Y', '5/3/2015'
union
select 1, 2, 'Y','5/3/2015'
union
select 1,1, 'Y', '5/6/2015'
union
select 1, 2,'N','5/6/2015')
select
*,
case when
value <>
isnull(lag(value) over (partition by cust_id, prod_type order by [date]),'')
then 1 else 0
end as keep
from
t
order by
[date],
cust_id,
prod_type