I have a loop which is comparing street addresses.
It then uses fuzzy matching to tokenise the addresses and compare the addresses. I have tried this both with fuzzywuzzy
and rapidfuzz
.
It subsequently returns how close the match is.
The aim is to try and take all my street addresses 30k or so and match a variation of the street address to a structured street address in my dataset.
The end result would be a reference table with two columns:
I am not a huge python user but do know that for loops are the last resort for most problems (third answer). With that in mind, I have used for loops. However my loops will take approx 235 hours which is sub-optimal to say the least. I have created a reproducible example below. Can anyone see where i can make any tweaks? I have added a progress bar to give you an idea of the speed. You can increase the number of addresses by changing the line for _ in range(20):
import pandas as pd
from tqdm import tqdm
from faker import Faker
from rapidfuzz import process, fuzz
# GENERATE FAKE ADDRESSES FOR THE REPRODUCIBLE EXAMPLE -----------------------------------------------
fake = Faker()
fake_addresses = pd.DataFrame()
for _ in range(20):
# Generate fake address
d = {'add':fake.address()}
df = pd.DataFrame(data = [d])
# Append it to the addresses dataframe
fake_addresses = pd.concat([fake_addresses, df])
fake_addresses = fake_addresses.reset_index(drop=True)
# COMPARE ADDRESSES ---------------------------------------------------------------------------------
# Here we are making a "dictionary" of the addresses where we use left side as a reference address
# We use the right side as all the different variations of the address. The addresses have to be
# 0% similar. Normally this is 95% similarity
reference = fake_addresses['add'].drop_duplicates()
ref_addresses = pd.DataFrame()
# This takes a long time. I have added tqdm to show how long when the number of addresses is increased dramatically
for address in tqdm(reference):
for raw_address in reference:
result = fuzz.token_sort_ratio(address, raw_address)
d = {'reference_address': address,
'matched_address': raw_address,
'matched_result': result}
df = pd.DataFrame(data = [d])
if len(df.index) > 0:
filt = df['matched_result'] >= 0
df = df.loc[filt]
ref_addresses = pd.concat([ref_addresses, df], ignore_index=True)
else:
ref_addresses = ref_addresses
I would start by pre-calculating the sorted tokens once for each address so that you don't end up doing it n*n-1 times. This allows you to bypass the processor and avoid calling the sort method of the fuzzer. After that, I would take pandas out of the picture at least while doing these tests.
This test runs through 1k faked address in about 2 seconds and 10k in about 4 1/2 minutes. At the moment if addr1 and addr2 are similar we record that both ways.:
import faker
import rapidfuzz
import itertools
import tqdm
## -----------------------
## Generate some "fake" addresses
## apply the deault process once to each address
## sort the tokens once for each address
## -----------------------
faked_address_count = 1_000
make_fake = faker.Faker()
fake_addresses = [make_fake.address() for _ in range(faked_address_count)]
fake_addresses = {
address: " ".join(sorted(
x.strip()
for x
in rapidfuzz.utils.default_process(address).split(" ")
if x.strip()
))
for address
in fake_addresses
}
## -----------------------
## -----------------------
## Find similar addresses
## -----------------------
threshhold = 82
results = {}
pairs = itertools.combinations(fake_addresses.items(), 2)
for (addr1, processed_addr1), (addr2, processed_addr2) in tqdm.tqdm(pairs):
similarity = rapidfuzz.fuzz.token_ratio(
processed_addr1,
processed_addr2,
processor=None
)
if similarity > threshhold:
results.setdefault(addr1, []).append([addr2, similarity])
results.setdefault(addr2, []).append([addr1, similarity]) # also record 2 similar to 1?
## -----------------------
## -----------------------
## Print the final results
## -----------------------
import json
print(json.dumps(results, indent=4))
## -----------------------
generating a result like:
{
"PSC 0586, Box 8976\nAPO AE 52923": [
["PSC 0586, Box 6535\nAPO AE 76148", 83.33333333333334]
],
"PSC 0586, Box 6535\nAPO AE 76148": [
["PSC 0586, Box 8976\nAPO AE 52923", 83.33333333333334]
],
"USNS Brown\nFPO AE 20210": [
["USNV Brown\nFPO AE 70242", 82.6086956521739]
],
"USNV Brown\nFPO AE 70242": [
["USNS Brown\nFPO AE 20210", 82.6086956521739]
]
}
To create a version that more directly aligns with what I think your inputs and outputs might be, you can take a look at. This will compare 1k standard addresses to 10k test addresses in about 40 seconds.:
import faker
import rapidfuzz
import itertools
import tqdm
## -----------------------
## Generate sets of standard addresses and test addresses
## -----------------------
make_fake = faker.Faker()
standard_addresses = [make_fake.address() for _ in range(1_000)]
test_addresses = [make_fake.address() for _ in range(2_000)]
## -----------------------
## -----------------------
## pre-process the addresses
## -----------------------
def address_normalizer(address):
return " ".join(sorted(
x.strip()
for x
in rapidfuzz.utils.default_process(address).split(" ")
if x.strip()
))
standard_addresses = {address: address_normalizer(address) for address in standard_addresses}
test_addresses = {address: address_normalizer(address) for address in test_addresses}
## -----------------------
## -----------------------
## Create a list to hold our results
## -----------------------
results = []
## -----------------------
## -----------------------
## Find similar addresses
## -----------------------
threshhold = 82
pairs = itertools.product(standard_addresses.items(), test_addresses.items())
for standard_address_kvp, test_address_kvp in tqdm.tqdm(pairs):
similarity = rapidfuzz.fuzz.token_ratio(
standard_address_kvp[1],
test_address_kvp[1],
processor=None
)
if similarity > threshhold:
results.append([standard_address_kvp[0], test_address_kvp[0]])
## -----------------------
for pair in results:
print(pair)
That will generate a result like:
['USS Collins\nFPO AE 04706', 'USNS Collins\nFPO AE 40687']
['USS Miller\nFPO AP 15314', 'USS Miller\nFPO AP 91203']
['Unit 4807 Box 9762\nDPO AP 67543', 'Unit 6542 Box 9721\nDPO AP 48806']
['Unit 4807 Box 9762\nDPO AP 67543', 'Unit 9692 Box 6420\nDPO AP 46850']