I have been analyzing this data over the past few months and have observed that identifying duplicates is quite time-intensive. The duplicates are determined based on five columns, including separate First Name and Last Name columns. For your reference, I have highlighted these column headers in yellow.
The criteria I apply when removing duplicate entries are as follows:
If the "Company Name" is a duplicate, or if the First and Last Names are not duplicates, even if Address and Phone are duplicates—the row is not considered a duplicate overall. This applies to all five columns.
If any cell is blank but the corresponding row shows duplicates in other columns, the blank one will be removed.
If all five columns indicate a duplicate, then one instance is retained, and the others are removed.
I have updated all examples in Google sheets where Green highlighted rows indicate duplicate and I have added notes in the last column which specifies which rows should be removed and which one should keep.
Output should be numeric in a helper column:
2 → This row is a duplicate group’s “winner” (keep this).
1 → This row is a duplicate (delete this).
0 → This row is unique (not a duplicate).
This table shows sample data in columns A:J
, with the desired result in the last column:
Index | Company | First | Last | Addr1 | Addr2 | City | State | Zip | Phone | desired result |
---|---|---|---|---|---|---|---|---|---|---|
I1 | Com1 | AA | BB | Add1 | AAAA | BBBB | 9 | 2 | ||
I2 | Com1 | AA | BB | AAAA | BBBB | 9 | 1 | |||
I3 | Com2 | UU | KK | Add2 | UUUU | KKKK | 10 | 0 | ||
I4 | Com2 | UU | SS | Add2 | UUUU | SSSS | 10 | 0 | ||
I7 | Com3 | LL | MM | Add6 | LLLL | MMMM | 14 | 0 | ||
I8 | Com4 | GG | HH | Add7 | GGGG | HHHH | 15 | 0 | ||
I9 | Com5 | PP | LL | Add8 | PPPP | LLLL | 16 | 0 | ||
I10 | Com5 | Add7 | 16 | 0 | ||||||
I13 | Com5 | PP | LL | Add12 | PPPP | LLLL | 16 | 0 | ||
I11 | Com7 | Add10 | 18 | 2 | ||||||
I12 | Com7 | 18 | 1 | |||||||
I14 | Add13 | 21 | 0 | |||||||
I15 | AA | BB | Add14 | AAAA | BBBB | 22 | 2 | |||
I16 | AA | BB | Add14 | AAAA | BBBB | 1 | ||||
I17 | UU | KK | Add16 | UUUU | KKKK | 24 | 0 | |||
I18 | Com100 | Add17 | 26 | 0 | ||||||
I19 | UU | PP | Add17 | UUUU | PPPP | 26 | 0 | |||
I20 | NN | DD | Add19 | NNNN | DDDD | 27 | 0 | |||
I21 | LL | MM | Add20 | LLLL | MMMM | 28 | 0 | |||
I22 | GG | HH | Add21 | GGGG | HHHH | 29 | 0 | |||
I23 | NN | DD | Add5 | NNNN | DDDD | 13 | 1 | |||
I6 | Com50 | NN | DD | Add5 | NNNN | DDDD | 13 | 2 | ||
I24 | GG | HH | Add23 | GGGG | HHHH | 31 | 0 | |||
I25 | AA | BB | Add24 | AAAA | BBBB | 32 | 2 | |||
I26 | AA | BB | Add24 | AAAA | BBBB | 32 | 1 | |||
I27 | UU | KK | Add16 | UUUU | KKKK | 34 | 0 | |||
I28 | UU | SS | Add17 | UUUU | SSSS | 35 | 0 | |||
I29 | UU | PP | Add18 | UUUU | PPPP | 36 | 0 | |||
I30 | Com69 | UU | PP | Add4 | UUUU | PPPP | 12 | 2 | ||
I5 | Com69 | UU | PP | Add4 | UUUU | PPPP | 12 | 1 | ||
I31 | LL | MM | Add20 | LLLL | MMMM | 38 | 0 |
My try:
=LET(
colC,$B$2:$B$10000, colF,$C$2:$C$10000, colL,$D$2:$D$10000, colA,$E$2:$E$10000, colP,$J$2:$J$10000,
c,B2, f,C2, l,D2, a,E2, p,J2,
matches,(IF(c="",TRUE,colC=c))*(IF(f="",TRUE,colF=f))*(IF(l="",TRUE,colL=l))*(IF(a="",TRUE,colA=a))*(IF(p="",TRUE,colP=p)),
scoreAll,(colC<>"")+(colF<>"")+(colL<>"")+(colA<>"")+(colP<>""),
myScore,(c<>"")+(f<>"")+(l<>"")+(a<>"")+(p<>""),
maxScore,MAX(IF(matches,scoreAll)),
hasPhoneAtMax,MAX(IF(matches*(scoreAll=maxScore),--(colP<>""))),
candidates, matches*(scoreAll=maxScore)*(IF(hasPhoneAtMax=1,colP<>"",TRUE)),
earliest, MIN(IF(candidates,ROW($B$2:$B$1000))),
IF(myScore=0,"Delete it", IF(AND(myScore=maxScore, IF(hasPhoneAtMax=1,p<>"",TRUE), ROW()=earliest),"Keep","Delete it"))
)
This formula will exactly match the desired results shown in the question:
=let(
data, filter(hstack(B2:E, J2:J), len(A2:A)),
rowKeys, byrow(data, lambda(row,
join("µ", bycol(row, lambda(v,
if(len(v), to_text(v), "*")
)))
)),
map(rowKeys, sequence(rows(data)), lambda(key, i,
ifs(
countif(rowKeys, key) = 1,
if(rows(filter(rowKeys, match(rowKeys, key, 0))) > 1, 2, 0),
len(key) = max(filter(len(rowKeys), search(key, rowKeys))) * (i = match(key, rowKeys, 0)),
2,
true,
1
)
))
)
See let(), map(), byrow(), bycol(), hstack(), filter(), match() and sequence().