google-sheetsgoogle-sheets-formula

identify duplicates across multiple columns with completeness rules


I have been analyzing this data over the past few months and have observed that identifying duplicates is quite time-intensive. The duplicates are determined based on five columns, including separate First Name and Last Name columns. For your reference, I have highlighted these column headers in yellow.

The criteria I apply when removing duplicate entries are as follows:

  1. If the "Company Name" is a duplicate, or if the First and Last Names are not duplicates, even if Address and Phone are duplicates—the row is not considered a duplicate overall. This applies to all five columns.

  2. If any cell is blank but the corresponding row shows duplicates in other columns, the blank one will be removed.

  3. If all five columns indicate a duplicate, then one instance is retained, and the others are removed.

I have updated all examples in Google sheets where Green highlighted rows indicate duplicate and I have added notes in the last column which specifies which rows should be removed and which one should keep.

Output should be numeric in a helper column:

2 → This row is a duplicate group’s “winner” (keep this).

1 → This row is a duplicate (delete this).

0 → This row is unique (not a duplicate).

This table shows sample data in columns A:J, with the desired result in the last column:

Index Company First Last Addr1 Addr2 City State Zip Phone desired result
I1 Com1 AA BB Add1 AAAA BBBB 9 2
I2 Com1 AA BB AAAA BBBB 9 1
I3 Com2 UU KK Add2 UUUU KKKK 10 0
I4 Com2 UU SS Add2 UUUU SSSS 10 0
I7 Com3 LL MM Add6 LLLL MMMM 14 0
I8 Com4 GG HH Add7 GGGG HHHH 15 0
I9 Com5 PP LL Add8 PPPP LLLL 16 0
I10 Com5 Add7 16 0
I13 Com5 PP LL Add12 PPPP LLLL 16 0
I11 Com7 Add10 18 2
I12 Com7 18 1
I14 Add13 21 0
I15 AA BB Add14 AAAA BBBB 22 2
I16 AA BB Add14 AAAA BBBB 1
I17 UU KK Add16 UUUU KKKK 24 0
I18 Com100 Add17 26 0
I19 UU PP Add17 UUUU PPPP 26 0
I20 NN DD Add19 NNNN DDDD 27 0
I21 LL MM Add20 LLLL MMMM 28 0
I22 GG HH Add21 GGGG HHHH 29 0
I23 NN DD Add5 NNNN DDDD 13 1
I6 Com50 NN DD Add5 NNNN DDDD 13 2
I24 GG HH Add23 GGGG HHHH 31 0
I25 AA BB Add24 AAAA BBBB 32 2
I26 AA BB Add24 AAAA BBBB 32 1
I27 UU KK Add16 UUUU KKKK 34 0
I28 UU SS Add17 UUUU SSSS 35 0
I29 UU PP Add18 UUUU PPPP 36 0
I30 Com69 UU PP Add4 UUUU PPPP 12 2
I5 Com69 UU PP Add4 UUUU PPPP 12 1
I31 LL MM Add20 LLLL MMMM 38 0

Sheet

My try:

=LET(
  colC,$B$2:$B$10000, colF,$C$2:$C$10000, colL,$D$2:$D$10000, colA,$E$2:$E$10000, colP,$J$2:$J$10000,
  c,B2, f,C2, l,D2, a,E2, p,J2,
  matches,(IF(c="",TRUE,colC=c))*(IF(f="",TRUE,colF=f))*(IF(l="",TRUE,colL=l))*(IF(a="",TRUE,colA=a))*(IF(p="",TRUE,colP=p)),
  scoreAll,(colC<>"")+(colF<>"")+(colL<>"")+(colA<>"")+(colP<>""),
  myScore,(c<>"")+(f<>"")+(l<>"")+(a<>"")+(p<>""),
  maxScore,MAX(IF(matches,scoreAll)),
  hasPhoneAtMax,MAX(IF(matches*(scoreAll=maxScore),--(colP<>""))),
  candidates, matches*(scoreAll=maxScore)*(IF(hasPhoneAtMax=1,colP<>"",TRUE)),
  earliest, MIN(IF(candidates,ROW($B$2:$B$1000))),
  IF(myScore=0,"Delete it", IF(AND(myScore=maxScore, IF(hasPhoneAtMax=1,p<>"",TRUE), ROW()=earliest),"Keep","Delete it"))
)

Solution

  • This formula will exactly match the desired results shown in the question:

    =let(
      data, filter(hstack(B2:E, J2:J), len(A2:A)),
      rowKeys, byrow(data, lambda(row,
        join("µ", bycol(row, lambda(v,
          if(len(v), to_text(v), "*")
        )))
      )),
      map(rowKeys, sequence(rows(data)), lambda(key, i,
        ifs(
          countif(rowKeys, key) = 1,
            if(rows(filter(rowKeys, match(rowKeys, key, 0))) > 1, 2, 0),
          len(key) = max(filter(len(rowKeys), search(key, rowKeys))) * (i = match(key, rowKeys, 0)),
            2,
          true,
            1
        )
      ))
    )
    

    See let(), map(), byrow(), bycol(), hstack(), filter(), match() and sequence().