arrayssortinggoogle-apps-scriptgoogle-sheetsfuzzy-logic

Is there a Google Apps Script for Fuzzy Lookups?


I have a list of companies on a spreadsheet that is rarely updated. I'll call it List A.

I also have a constantly updating weekly list of companies (List B) that should have entries that match some entries on List A.

The reality is that the data extracted from List B's company names are often inconsistent due to various business abbreviations (e.g. The Company, Company Ltd., Company Accountants Limited). Sometimes, these companies are under different trading names or have various mispellings.

My initial very not intelligent reaction was to construct a table of employer alias names, with the first column being the true employer name and the following columns holding alises, something like this: [https://i.sstatic.net/2cmYv.png]

On the left is a sample table, and the far right is a column where I am using the following array formula template:

=ArrayFormula(INDEX(A30:A33,MATCH(1,MMULT(--(B30:E33=H30),TRANSPOSE(COLUMN(B30:E33)^0)),0)))

I realized soon after that I needed to create a new entry for every single exact match variation (Ltd., Ltd, and Limited), so I looked into fuzzy lookups. I was really impressed by Alan's Fuzzy Matching UDFs, but my needs heavily lean towards using Google Spreadsheets rather than VBA.

Sorry for the long post, but I would be grateful if anyone has any good suggestions for fuzzy lookups or can suggest an alternative solution.


Solution

  • The comments weren't exactly what I was looking for, but they did provide some inspiration for me to come up with a bandaid solution.

    My original array formula needed exact matches, but the problem was that there were simply too many company suffixes and alternate names, so I looked into fuzzy lookups.

    My current answer is to abandon the fuzzy lookup proposal and instead focus on editing the original data string (i.e. company names) into more simplified substring. Grabbing with a few codes floating around, I came up with a combined custom formula that implements two lines for GApps Script:

     var companysuffixremoval = str.toString().replace(/limited|ltd|group|holdings|plc|llp|sons|the/ig, "");
    
     var alphanumericalmin = str.replace(/[^A-Za-z0-9]/g,"")
    

    The first line is simply my idea of removing popular company suffixes and the "the" from the string.

    The second line is removing all non-alphanumerical characters, as well as any spaces and periods.

    This ensure "The First Company Limited." and "First Company Ltd" become "FirstCompany", which should work with returning the same values from the original array formula in the OP. Of course, I also implemented a trimming and cleaning line for any trailing/leading/extra spaces for the initial string, but that's probably unncessary with the second line.

    If someone can come up with a better idea, please do tell. As it stands, I'm just tinkering with a script with minimal experience.