I want to compute similarity between items (0,1,2,3..) based on their temporal information. Temporal information may be time instant (startdate), time interval (startdate,enddate) or null (NaT); see an example of dataframe (df_for) below.
The following python code obtains temporal information from the dataframe and performs the conditions above (1-5). The code is verbose, i wonder if there is a smart way/lib to calculate similarity between time periods and time instants using python.
m, k = df_for.shape
sim = np.zeros((m, m))
data = df_for.values
for i in range(m):
for j in range(m):
if i != j:
st1 = data[i][0]
ed1 = data[i][1]
st2 = data[j][0]
ed2 = data[j][1]
#both items are null values
if pd.isnull(st1) and pd.isnull(ed1) and pd.isnull(st2) and pd.isnull(ed2):
sim[i][j] = 0.
# {instant, instant} => equal, not equal
if pd.notnull(st1) and pd.isnull(ed1) and pd.notnull(st2) and pd.isnull(ed2):
if st1 == st2:
sim[i][j] = 1.
else:
sim[i][j] = 0.
# {instant, null} => sim is 0
if pd.notnull(st1) and pd.isnull(ed1) and pd.isnull(st2) and pd.isnull(ed2):
sim[i][j] = 0.
# {instant, interval} => meets, during
if pd.notnull(st1) and pd.isnull(ed1) and pd.notnull(st2) and pd.notnull(ed2):
if(st2 <= st1 <= ed2):
sim[i][j] = 1. #a time is between two other times
else:
sim[i][j] = 0.
# {interval, instant} => meets, contains
if pd.notnull(st1) and pd.notnull(ed1) and pd.notnull(st2) and pd.isnull(ed2):
if(st1 <= st2 <= ed1):
sim[i][j] = 1. #a time is between two other times
else:
sim[i][j] = 0.
# {interval, interval} => equal, overlaps, not overlaps
if pd.notnull(st1) and pd.notnull(ed1) and pd.notnull(st2) and pd.notnull(ed2):
if (st1 <= st2 <= ed1) or (st2 <= st1 <= ed2):
intersect = min(ed1,ed2)- max(st1,st2) # earliestend-lateststart
union = max(st1,st2,ed1,ed2) - min(ed1,ed2,st1,st2)
overlaps = intersect/union
#print(intersect/np.timedelta64(1, 'D'),union/np.timedelta64(1, 'D'))
if (st1 > st2 and ed1 < ed2) or (st1 < st2 and ed1 > ed2): # contains, during
overlaps = 1.0
sim[i][j]=overlaps
else:
sim[i][j] = 0.
else:
sim[i][j] = 1.
I can see several ways to simplify the code.
First, I'd suggest moving the comparison code to a separate function that takes st1
, ed1
, st2
, and ed2
as arguments. (As a side note, why st
(start time) but ed
(end date)? Consistent names might be nice.) You'd be able to return
your result to the calling code, which would be responsible for doing the assignment into the array of results.
Speaking of that calling code... It doesn't need to call the comparison function for every pair of time-ranges. The results should always be symmetrical (e.g. compare(data[i][0], data[i][1], data[j][0], data[j][1])
is going to return the same thing as compare(data[j][0], data[j][1], data[i][0], data[i][1])
). So just call one of those, and assign the results to both sim[i][j]
and sim[j][i]
.
Rather than lots of if some_test_expression: return 1; else: return 0
blocks, just return the results of the comparison: return some_test_expression
. Python's bool
type is a subclass of int
, so these should convert to numbers automatically when you assign them into a numpy array (you could also cast to float
explicitly if you want the function to always return the same type).
Handle all cases with nulls at the top. They're an easy case, and if you deal with them first (and return, so the rest of the function doesn't run), you don't need to check as many things for null later. You don't need to handle null/null, null/instantand null/interval separately, just return zero if either start value is null: if isnull(st1) or isnull(st2): return 0.
Instant/interval comparisons are symmetrical, so just write the logic one way and flip the parameters around if they're initially in the wrong order: if isnull(ed2): st1, ed1, st2, ed2 = st2, ed2, st1, ed1
I don't see too much that can be improved with the specific implementations of your comparisons between instants and intervals or between two intervals (other than the general things mentioned already above). However, I would like to say that your formula for overlapping intervals will have a big discontinuity in similarity as a small interval slowly overlaps more with a larger one, then gets fully contained. For instance, a one second interval that starts just a millisecond before a one hour interval will result in a similarity value between the two of 0.000277
(0.999 / (3600 + 0.001)
), but one that starts at the same time as the larger interval does will have similarity of 1.0
(a big difference). A more continuously changing measure of similarity would come from dividing the overlap amount by the size of the smaller interval (rather than the size of the union). This formula doesn't need a special case for one interval fully containing another, since the default calculation will give a similarity of 1.0
already (the overlap will be the size of the smaller interval, so you'll be dividing it by itself).
So, putting all that together, here's how I'd rewrite things:
def compare_intervals(st1, et1, st2, et2): # note I've renamed the ed's to et
# all nulls
if pd.isnull(st1) or pd.isnull(st2):
return 0.
# {instant, instant} => equal, not equal
if pd.isnull(et1) and pd.isnull(et2):
return float(st1 == st2)
# {instant, interval} => flip params (to be handled in next block)
if pd.isnull(et1):
st1, et1, st2, et2 = st2, et2, st1, et1
# {interval, instant} => meets, contains
if pd.isnull(et2):
return float(st1 <= st2 <= et1)
# {interval, interval} => equal, overlaps, not overlaps
if (st1 <= st2 <= et1) or (st2 <= st1 <= et2):
intersect = min(et1, et2) - max(st1, st2)
min_interval = min(et1 - st1, et2 - st2) # using minimum instead of union
return intersect / min_interval
return 0. # intervals didn't overlap
m, k = df_for.shape
sim = np.zeros((m, m))
data = df_for.values
for i in range(m):
for j in range(i, m): # we only iterate on j values >= i
if i == j:
sim[i,j] = 1.
else:
sim[i,j] = sim[j,i] = compare_intervals(data[i][0], data[i][1],
data[j][0], data[j][1])
A few notes that I couldn't fit into comments in the code:
No test for the values being intervals is needed at the last step of the compare_interval
function, since all other cases (involving instants and nulls) have already been handled.
I've changed the assignments to the sim
array to use tuple indexes, as that's more natural than nested indexes for multidimensional numpy arrays (slices don't work if you use nested indexes). You might be able to do the same thing with the data
lookups, but I've not changed them since I don't know pandas nearly as well as I know numpy.
Speaking of my lack of pandas knowledge, there is probably a way to apply the comparison function directly to a "join" of the dataframe with itself. I don't know how to do that though. If it exists, it will probably be faster than the nested loops I left mostly unchanged in the calling code (even if it doesn't optimize out the symmetric cases).