I'm trying to discern the string similarity between two strings (using Jaro). Each string resides in a separate column in my dataframe.
String 1 = df['name_one']
String 2 = df['name_two']
When I try to run my string similarity logic:
from pyjarowinkler import distance
df['distance'] = df.apply(lambda d: distance.get_jaro_distance(str(d['name_one']),str(d['name_two']),winkler=True,scaling=0.1), axis=1)
I get the following error:
**error: JaroDistanceException: Cannot calculate distance from NoneType (str, str)**
Great, so there is a nonetype in the columns, so the first thing I do is check for this:
maskone = df['name_one'] == None
df[maskone]
masktwo = df['name_two'] == None
df[masktwo]
This yields in no None types found.... I'm scratching my head here at this point, but proceed to clean the two columns any ways.
df['name_one'] = df['name_one'].fillna('').astype(str)
df['name_two'] = df['name_two'].fillna('').astype(str)
And yet, I'm still getting:
error: JaroDistanceException: Cannot calculate distance from NoneType (str, str)
Am I removing NoneTypes correctly?
The issue isn't exactly that you are only experiencing NoneTypes
but empty strings which can also throw this exception as you can see in the implementation of distance.get_jaro_distance
if not first or not second:
raise JaroDistanceException("Cannot calculate distance from NoneType ({0}, {1})".format(
first.__class__.__name__,
second.__class__.__name__))
Trying replacing your none types and/or empty strings with 'NA' or filtering them from your dataset.
Use a flag value/distance for rows that may raise this exception . In the example below, I will utilize 999
from pyjarowinkler import distance
df['distance'] = df.apply(lambda d: 999 if not str(d['name_one']) or not str(d['name_two']) else distance.get_jaro_distance(str(d['name_one']),str(d['name_two']),winkler=True,scaling=0.1), axis=1)