[SOLVED] identify duplicates across multiple columns with completeness rules

identify duplicates across multiple columns with completeness rules

I have been analyzing this data over the past few months and have observed that identifying duplicates is quite time-intensive. The duplicates are determined based on five columns, including separate First Name and Last Name columns. For your reference, I have highlighted these column headers in yellow.

The criteria I apply when removing duplicate entries are as follows:

If the "Company Name" is a duplicate, or if the First and Last Names are not duplicates, even if Address and Phone are duplicates—the row is not considered a duplicate overall. This applies to all five columns.
If any cell is blank but the corresponding row shows duplicates in other columns, the blank one will be removed.
If all five columns indicate a duplicate, then one instance is retained, and the others are removed.

I have updated all examples in Google sheets where Green highlighted rows indicate duplicate and I have added notes in the last column which specifies which rows should be removed and which one should keep.

Output should be numeric in a helper column:

2 → This row is a duplicate group’s “winner” (keep this).

1 → This row is a duplicate (delete this).

0 → This row is unique (not a duplicate).

This table shows sample data in columns A:J, with the desired result in the last column:

Index	Company	First	Last	Addr1	Addr2	City	Phone	desired result
I1	Com1	AA	BB	Add1	AAAA	BBBB	9	2
I2	Com1	AA	BB		AAAA	BBBB	9	1
I3	Com2	UU	KK	Add2	UUUU	KKKK	10	0
I4	Com2	UU	SS	Add2	UUUU	SSSS	10	0
I7	Com3	LL	MM	Add6	LLLL	MMMM	14	0
I8	Com4	GG	HH	Add7	GGGG	HHHH	15	0
I9	Com5	PP	LL	Add8	PPPP	LLLL	16	0
I10	Com5			Add7			16	0
I13	Com5	PP	LL	Add12	PPPP	LLLL	16	0
I11	Com7			Add10			18	2
I12	Com7						18	1
I14				Add13			21	0
I15		AA	BB	Add14	AAAA	BBBB	22	2
I16		AA	BB	Add14	AAAA	BBBB		1
I17		UU	KK	Add16	UUUU	KKKK	24	0
I18	Com100			Add17			26	0
I19		UU	PP	Add17	UUUU	PPPP	26	0
I20		NN	DD	Add19	NNNN	DDDD	27	0
I21		LL	MM	Add20	LLLL	MMMM	28	0
I22		GG	HH	Add21	GGGG	HHHH	29	0
I23		NN	DD	Add5	NNNN	DDDD	13	1
I6	Com50	NN	DD	Add5	NNNN	DDDD	13	2
I24		GG	HH	Add23	GGGG	HHHH	31	0
I25		AA	BB	Add24	AAAA	BBBB	32	2
I26		AA	BB	Add24	AAAA	BBBB	32	1
I27		UU	KK	Add16	UUUU	KKKK	34	0
I28		UU	SS	Add17	UUUU	SSSS	35	0
I29		UU	PP	Add18	UUUU	PPPP	36	0
I30	Com69	UU	PP	Add4	UUUU	PPPP	12	2
I5	Com69	UU	PP	Add4	UUUU	PPPP	12	1
I31		LL	MM	Add20	LLLL	MMMM	38	0

Sheet

My try:

=LET(
  colC,$B$2:$B$10000, colF,$C$2:$C$10000, colL,$D$2:$D$10000, colA,$E$2:$E$10000, colP,$J$2:$J$10000,
  c,B2, f,C2, l,D2, a,E2, p,J2,
  matches,(IF(c="",TRUE,colC=c))*(IF(f="",TRUE,colF=f))*(IF(l="",TRUE,colL=l))*(IF(a="",TRUE,colA=a))*(IF(p="",TRUE,colP=p)),
  scoreAll,(colC<>"")+(colF<>"")+(colL<>"")+(colA<>"")+(colP<>""),
  myScore,(c<>"")+(f<>"")+(l<>"")+(a<>"")+(p<>""),
  maxScore,MAX(IF(matches,scoreAll)),
  hasPhoneAtMax,MAX(IF(matches*(scoreAll=maxScore),--(colP<>""))),
  candidates, matches*(scoreAll=maxScore)*(IF(hasPhoneAtMax=1,colP<>"",TRUE)),
  earliest, MIN(IF(candidates,ROW($B$2:$B$1000))),
  IF(myScore=0,"Delete it", IF(AND(myScore=maxScore, IF(hasPhoneAtMax=1,p<>"",TRUE), ROW()=earliest),"Keep","Delete it"))
)

Solution

This formula will exactly match the desired results shown in the question:

=let(
  data, filter(hstack(B2:E, J2:J), len(A2:A)),
  rowKeys, byrow(data, lambda(row,
    join("µ", bycol(row, lambda(v,
      if(len(v), to_text(v), "*")
    )))
  )),
  map(rowKeys, sequence(rows(data)), lambda(key, i,
    ifs(
      countif(rowKeys, key) = 1,
        if(rows(filter(rowKeys, match(rowKeys, key, 0))) > 1, 2, 0),
      len(key) = max(filter(len(rowKeys), search(key, rowKeys))) * (i = match(key, rowKeys, 0)),
        2,
      true,
        1
    )
  ))
)

See let(), map(), byrow(), bycol(), hstack(), filter(), match() and sequence().