excelexcel-formulafuzzy-searchfuzzy-comparison

Grouping similar text?


I have a list of landowners and whatever is highlighted shows a similar text string. these highlighted groupings are the same landowner but a slightly different text string. I was thinking maybe column B could have an index like I show below?

enter image description here

Is there a way I can use a formula or something to group these? I can't really do a fuzzy lookup because I don't have a list of names to match them to. I already tried using something like Tableau Prep's pronunciation grouping, but that does more harm than good.

Are there any options here?


Solution

  • From what I can determine, version 16.8 appears to correspond to Excel 2019 on the Mac, which means you're in helper-column territory: screenshot illustrating array-formula based suggestion

    the array formula in C2 is

    =MATCH(1,LEN(SUBSTITUTE(MID(A2,ROW(A$1:INDEX(A:A,LEN(A2))),5)," ",""))-5+(CODE(SUBSTITUTE(MID(A2,ROW(A$1:INDEX(A:A,LEN(A2))),5)," ",""))>64),0)
    

    and the regular formula in B2 is

    =MID(A2,C2,FIND(" ",A2,C2)-C2)
    

    The formula in column C returns the position of the first 5-character substring of the name in column A to begin with an actual letter. To track a different length substring you would need to replace the 3 instances of 5 with another number (e.g. updating the 3 x 5 to 3 x 4 would return almost the same results, the difference being that B10 would then show EAST).

    The formula in column B is just returning the substring from the name, starting with the position indicated in column C, up until the first instance of a space character after the position in column C so it would return an error if there were no spaces after the 5-character substring (unlikely, with all the LLC entries).

    As stated in my comment, this might not work for all of your data but if you sort any errors to the end you could then update the formula in column C to track a substring of a different length, and then sort any remaining errors to the end etc.