pythonloopsfuzzywuzzy

Python Speeding Up a Loop for text comparisons


I have a loop which is comparing street addresses. It then uses fuzzy matching to tokenise the addresses and compare the addresses. I have tried this both with fuzzywuzzy and rapidfuzz. It subsequently returns how close the match is. The aim is to try and take all my street addresses 30k or so and match a variation of the street address to a structured street address in my dataset. The end result would be a reference table with two columns:

I am not a huge python user but do know that for loops are the last resort for most problems (third answer). With that in mind, I have used for loops. However my loops will take approx 235 hours which is sub-optimal to say the least. I have created a reproducible example below. Can anyone see where i can make any tweaks? I have added a progress bar to give you an idea of the speed. You can increase the number of addresses by changing the line for _ in range(20):

import pandas as pd
from tqdm import tqdm
from faker import Faker
from rapidfuzz import process, fuzz

# GENERATE FAKE ADDRESSES FOR THE REPRODUCIBLE EXAMPLE -----------------------------------------------
fake = Faker()
fake_addresses = pd.DataFrame()

for _ in range(20):
    # Generate fake address
    d = {'add':fake.address()}
    df = pd.DataFrame(data = [d])

    # Append it to the addresses dataframe
    fake_addresses = pd.concat([fake_addresses, df])

fake_addresses = fake_addresses.reset_index(drop=True)

# COMPARE ADDRESSES ---------------------------------------------------------------------------------   
# Here we are making a "dictionary" of the addresses where we use left side as a reference address
# We use the right side as all the different variations of the address. The addresses have to be 
# 0% similar. Normally this is 95% similarity    
reference = fake_addresses['add'].drop_duplicates()
ref_addresses = pd.DataFrame()

# This takes a long time. I have added tqdm to show how long when the number of addresses is increased dramatically
for address in tqdm(reference):
    for raw_address in reference:
        result = fuzz.token_sort_ratio(address, raw_address)
        
        d = {'reference_address': address, 
             'matched_address': raw_address,
             'matched_result': result}
        
        df = pd.DataFrame(data = [d])
        
        if len(df.index) > 0:
            filt = df['matched_result'] >= 0
            df = df.loc[filt]
            ref_addresses = pd.concat([ref_addresses, df], ignore_index=True)
        else:
            ref_addresses = ref_addresses

Solution

  • I would start by pre-calculating the sorted tokens once for each address so that you don't end up doing it n*n-1 times. This allows you to bypass the processor and avoid calling the sort method of the fuzzer. After that, I would take pandas out of the picture at least while doing these tests.

    This test runs through 1k faked address in about 2 seconds and 10k in about 4 1/2 minutes. At the moment if addr1 and addr2 are similar we record that both ways.:

    import faker
    import rapidfuzz
    import itertools
    import tqdm
    
    ## -----------------------
    ## Generate some "fake" addresses
    ## apply the deault process once to each address
    ## sort the tokens once for each address
    ## -----------------------
    faked_address_count = 1_000
    make_fake = faker.Faker()
    fake_addresses = [make_fake.address() for _ in range(faked_address_count)]
    fake_addresses = {
        address: " ".join(sorted(
            x.strip()
            for x
            in rapidfuzz.utils.default_process(address).split(" ")
            if x.strip()
        ))
        for address
        in fake_addresses
    }
    ## -----------------------
    
    ## -----------------------
    ## Find similar addresses
    ## -----------------------
    threshhold = 82
    results = {}
    pairs = itertools.combinations(fake_addresses.items(), 2)
    for (addr1, processed_addr1), (addr2, processed_addr2) in tqdm.tqdm(pairs):
        similarity = rapidfuzz.fuzz.token_ratio(
            processed_addr1,
            processed_addr2,
            processor=None
        )
    
        if similarity > threshhold:
            results.setdefault(addr1, []).append([addr2, similarity])
            results.setdefault(addr2, []).append([addr1, similarity])  # also record 2 similar to 1?
    ## -----------------------
    
    ## -----------------------
    ## Print the final results
    ## -----------------------
    import json
    print(json.dumps(results, indent=4))
    ## -----------------------
    

    generating a result like:

    {
        "PSC 0586, Box 8976\nAPO AE 52923": [
            ["PSC 0586, Box 6535\nAPO AE 76148", 83.33333333333334]
        ],
        "PSC 0586, Box 6535\nAPO AE 76148": [
            ["PSC 0586, Box 8976\nAPO AE 52923", 83.33333333333334]
        ],
        "USNS Brown\nFPO AE 20210": [
            ["USNV Brown\nFPO AE 70242", 82.6086956521739]
        ],
        "USNV Brown\nFPO AE 70242": [
            ["USNS Brown\nFPO AE 20210", 82.6086956521739]
        ]
    }
    

    To create a version that more directly aligns with what I think your inputs and outputs might be, you can take a look at. This will compare 1k standard addresses to 10k test addresses in about 40 seconds.:

    import faker
    import rapidfuzz
    import itertools
    import tqdm
    
    ## -----------------------
    ## Generate sets of standard addresses and test addresses
    ## -----------------------
    make_fake = faker.Faker()
    standard_addresses = [make_fake.address() for _ in range(1_000)]
    test_addresses = [make_fake.address() for _ in range(2_000)]
    ## -----------------------
    
    ## -----------------------
    ## pre-process the addresses
    ## -----------------------
    def address_normalizer(address):
        return " ".join(sorted(
            x.strip()
            for x
            in rapidfuzz.utils.default_process(address).split(" ")
            if x.strip()
        ))
    standard_addresses = {address: address_normalizer(address) for address in standard_addresses}
    test_addresses = {address: address_normalizer(address) for address in test_addresses}
    ## -----------------------
    
    ## -----------------------
    ## Create a list to hold our results
    ## -----------------------
    results = []
    ## -----------------------
    
    ## -----------------------
    ## Find similar addresses
    ## -----------------------
    threshhold = 82
    pairs = itertools.product(standard_addresses.items(), test_addresses.items())
    for standard_address_kvp, test_address_kvp in tqdm.tqdm(pairs):
        similarity = rapidfuzz.fuzz.token_ratio(
            standard_address_kvp[1],
            test_address_kvp[1],
            processor=None
        )
    
        if similarity > threshhold:
            results.append([standard_address_kvp[0], test_address_kvp[0]])
    ## -----------------------
    
    for pair in results:
        print(pair)
    

    That will generate a result like:

    ['USS Collins\nFPO AE 04706', 'USNS Collins\nFPO AE 40687']
    ['USS Miller\nFPO AP 15314', 'USS Miller\nFPO AP 91203']
    ['Unit 4807 Box 9762\nDPO AP 67543', 'Unit 6542 Box 9721\nDPO AP 48806']
    ['Unit 4807 Box 9762\nDPO AP 67543', 'Unit 9692 Box 6420\nDPO AP 46850']