I have been developing a matching system which matches the rows of the client and our central database depending on similarity. I have used a hybrid approach where I needed to somehow map the Company, model, variant and other features of a product. For string type data, I have applied Fuzzy logic and SBERT embedding similarity depending on which one performed better on that particular field.
I have made the model, and it works decently well. But as expected, the model is a bit slow. To increase the speed, I tried using Rapidfuzz which works way faster than Fuzzywuzzy. This is how I define my fuzzy score function.
def cached_fuzzy_score(a, b):
return fuzz.token_set_ratio(a, b)/ 100
The only change I did is changing 'from fuzzywuzzy import fuzz' to 'from rapidfuzz import fuzz' which usually works quite well and even calculates right fuzzy score as fuzzywuzzy. Still in the final matching nothing is showing up. I tried doing quite a few things, like writing float(), assuming might be return types be an issue, tried changing the thresholds, but all to vain as the scores are still the same! But interestingly I found out : when I changed the threshold to 0, it worked fine...But I am not sure if this is right.
def get_candidates_vectorized(bank_make, central_df, threshold=60):
# Use fuzzy matching on make names
make_scores = central_df['make_name'].apply(
lambda x: fuzz.token_set_ratio(bank_make, x)
)
return central_df[make_scores > threshold].index.tolist()
# Optimized scoring function
def calculate_scores_batch(bank_rows, central_indices, central_df,
model_vectorizer, segment_vectorizer,
model_matrix, segment_matrix,
segment_embeddings, identity_embeddings,
central_variant_embeddings, central_identity_embeddings,
weights):
results = []
bank_models = [row['bank_model'] for row in bank_rows]
model_sims = batch_semantic_similarity(bank_models, model_vectorizer, model_matrix)
# Convert embeddings to torch tensors
import torch
bank_segment_embeddings = torch.stack(segment_embeddings)
bank_identity_embeddings = torch.stack(identity_embeddings)
sbert_segment_sims = util.pytorch_cos_sim(bank_segment_embeddings, central_variant_embeddings)
sbert_identity_sims = util.pytorch_cos_sim(bank_identity_embeddings, central_identity_embeddings)
for i, bank_row in enumerate(bank_rows):
best_score = 0
best_match_idx = None
for central_idx in central_indices[i]:
central_row = central_df.iloc[central_idx]
make_score = cached_fuzzy_score(bank_row['bank_make'], central_row['make_name'])
model_score = model_sims[i, central_idx]
segment_score = sbert_segment_sims[i][central_idx].item()
fuel_score = cached_fuzzy_score(
str(bank_row.get('extracted_fuel_type', '')), str(central_row['fuel_type'])
)
transmission_score = cached_fuzzy_score(
str(bank_row.get('transmission_type', '')), str(central_row['transmission'])
)
displacement_score = cached_fuzzy_score(
str(bank_row.get('extracted_displacement', '')), str(central_row['displacement_formatted'])
)
identity_score = sbert_identity_sims[i][central_idx].item()
# BS rating comparison
bank_bs = bank_row.get('bs_rating')
central_bs = central_row.get('bs_rating')
if bank_bs is not None and central_bs is not None:
try:
bs_score = 1.0 - (abs(float(bank_bs) - float(central_bs)) / 3.0)
bs_score = max(bs_score, 0.0)
except:
bs_score = cached_fuzzy_score(str(bank_bs), str(central_bs))
else:
bs_score = 0.0
total_score = (
weights['make'] * make_score +
weights['model'] * model_score +
weights['segment'] * segment_score +
weights['fuel'] * fuel_score +
weights['transmission'] * transmission_score +
weights['displacement'] * displacement_score +
weights.get('bs', 0) * bs_score +
weights.get('identity', 0.1) * identity_score
)
if total_score > best_score:
best_score = total_score
best_match_idx = central_idx
results.append((best_match_idx, best_score))
return results
So, the root problem lies in the get_candidates_vectorised
function. The rapidfuzz library actually returns output based on the case-sensitiveness. So you need to change this function to ensure entire central is not filtered to elimination. (Add .lower()
to each bank_make
and x
)
def get_candidates_vectorized(bank_make, central_df, threshold=60):
# Use fuzzy matching on make names
make_scores = central_df['make_name'].apply(
lambda x: fuzz.token_set_ratio(bank_make.lower(), x.lower())
)
return central_df[make_scores > threshold].index.tolist()