pythonpython-2.7dictionaryintervalsoverlapping-matches

Matching overlapping Intervals (Dictionary) in Python


So, I was trying to solve some random problem via python and got stuck on the logic. Here's the problem : I have multiple videos along with their run time or running length. Now, I want to maintain 2 lists. 1 list if "synced" and the other one is "non synced". We decide "synced", if the difference of run time of the streams is less than or equal to 2 seconds. Otherwise, they're not synced. If we have multiple streams that match, then we take the streams with the highest match count/number.

I was able to come up with a very simple/slow method to part and pair these files. However, my logic failed when I got a different data set.

This is what I have written :

from datetime import datetime


# same_streams_old = {
  "Stream_1": "0:24:08.925167",
  "Stream_2": "0:24:08.990644",
  "Stream_3": "0:24:08.990644",
  "Stream_4": "0:24:12.118778",
  "Stream_5": "0:24:12.118778",
  "stream_6": "0:24:10.075066"
}
same_streams = {
  "Stream_1": "0:24:08.925167",
  "Stream_2": "0:24:12.118778",
  "Stream_3": "0:23:11.057711",
  "Stream_4": "0:24:12.118778",
  "Stream_5": "0:24:10.075066",
  "Stream_6": "0:24:08.990644"
}

keys = []
values = []
final_synced_video_files = []
final_non_synced_video_files = []

def get_time_diff(episode_run_time, episode_time):
    prev_episode_time = datetime.strptime(episode_run_time, '%H:%M:%S.%f')
    current_episode_time = datetime.strptime(episode_time, '%H:%M:%S.%f')

    time_diff = prev_episode_time - current_episode_time
    if current_episode_time > prev_episode_time:
        time_diff = current_episode_time - prev_episode_time

    return float(time_diff.seconds)

for key, value in same_streams.items():
    keys.append(key)
    values.append(value)
for key in keys:
    for _key in keys:
        if key != _key:
            diff = get_time_diff(same_streams[key], same_streams[_key])
            if diff <= 1.5:
                final_synced_video_files.append(key)
            else:
                pass

final_synced_video_files = list(set(final_synced_video_files))
final_non_synced_video_files = list(set(keys) - set(final_synced_video_files))

print("Synced Files : {0}".format(final_synced_video_files))
print("Non Synced Files : {0}".format(final_non_synced_video_files))

REPL Link

As you can see that the streams with most matches are stream_1, stream_2, stream_3 and stream_6.

What I've written doesn't compare the maximum counts yet. However, as I work on this, I feel like this is not really effective and a good way to solve this. Any inputs anyone?

I tried some approach on overlapping intervals and then got this : REPL LINK

But, if you run see both the same_streams dictionary, you'll see results are not what I'm trying to achieve. Any help with this would be great.

EDIT: I need to get the streams that have difference of 2 seconds with each other. For example : In same_streams_old, desired result would be streams 1,2,3 & 6. However, in the dictionary same_streams, desired result is streams 2,4 & 5.

Basically, I need to see which streams can be "muxed" together and which ones can't be muxed.


Solution

  • Ok, so the print routines below might be a little messy, but bear with me, they are only for debugging and you might not need them when done. I know this is a long answer, but please read it carefully ..

    SHORT ANSWER:

    Your problem might come from the fact that you lose the decimal precision when you convert from string to datetime, which treats seconds as integers. However, the timedelta object has a method named total_seconds() which does provide sub-second resolution. See this or the general docs for details. Simply change the return statement of get_time_diff() to,

     return float(time_diff.total_seconds())
    

    MOTIVATION FOR LONG ANSWER:

    I'm not sure what you're trying to achieve with the (non- and) synced list(s): You may have a situation that stream a is synced with stream b, while stream c is synced with d, but c and d are not synced with a and b. Should they all be in your synced_list? Depending on what you want to do with the lists, I would consider using the synced matrix described below, instead of your list(s) as they lose much information.

    LONG ANSWER:

    Let me introduce the concept of a Synced Matrix. It will give a complete description of which of your streams are synced with each other:

    THE SYNC MATRIX: A symmetric matrix; Cell (i,j) in the matrix is TRUE if, and only if, stream 'i' and 'j' are in sync. Else, the cell value is FALSE. Hence, the diagonal (.) is entirely TRUE because a stream is always in sync with itself.

         1 2 3 4
        ________
     1 | . T T F
     2 |   . T F
     3 |     . F
     4 |       .
    

    "T" is true, and "F" is false: obviously from the example drawing above, stream 1 is in sync with stream 2, but not in sync with stream 4.

    The creation of such a Synced Matrix is quite simple for your example:

    def is_synced(key_1, key_2):    
        max_allowed_desync = 1.5
        return max_allowed_desync > get_time_diff(same_streams[key_1], same_streams[key_2])
    
    keys = same_streams.keys()
    keys.sort() # VERY IMPORTANT, for the synced matrix to be constructed correctly; also make 's' uppercase for "stream_6" in OP.
    
    # The complete matrix ..
    full_sync_matrix = [[is_synced(k1,k2) for k2 in keys] for k1 in keys]
    
    # We can optimize (memory usage) to only get the half matrix, since it's symmetric anyway; also excluding the diagonal.
    half_sync_matrix = [[is_synced(k1,k2) for k2 in keys[curr+1:]] for curr,k1 in enumerate(keys)]
    

    Now, let's implement two functions for printing/displaying the synced matrix:

    # Print a HALFED sync matrix
    def print_sync_half_matrix(sm):
        string = ""
        for i,row in enumerate(sm):
            string += "\n" + " "*i*2
            for col in row:
                string += " " + ("T" if col else "F")
        print(string)
    
    # Print a COMPLETE sync_matrix
    def print_sync_full_matrix(sm):
        string = ""
        for row in sm:
            string += "\n"
            for col in row:
                string += " " + ("T" if col else "F")
        print(string)
    

    Then, for the data set you provided I get:

    same_streams = {
      "Stream_1": "0:24:08.925167",
      "Stream_2": "0:24:08.990644",
      "Stream_3": "0:24:08.990644",
      "Stream_4": "0:24:12.118778",
      "Stream_5": "0:24:12.118778",
      "Stream_6": "0:24:10.075066"
    } # note that "Stream_6" previously had a lower case 's'!
    
    print_sync_half_matrix(half_sync_matrix)
    #   1 2 3 4 5 6
    # 1   T T F F T
    # 2     T F F T
    # 3       F F T
    # 4         T F
    # 5           F
    

    Remember that the diagonal is not included in the matrix/print! The result here is correct and as expected from the input. Let us print out some time differences to get more intelligence,

    for stream_key in same_stream:
        print("Stream_1 ~ "+stream_key+": "+str(get_time_diff(same_streams["Stream_1"], same_streams[stream_key])))
    

    .. which quickly reveals that your time stamps has lost its decimal precisions:

    Stream_1 ~ Stream_5: 3.0
    Stream_1 ~ Stream_4: 3.0
    # ...
    

    If we look into the documentation of datetime we see that it treats time as an integer of seconds. Hence, the microseconds precision is lost when you request seconds from the datetime object in the get_time_diff function. Simply solved by requesting seconds from the deltatime method .total_seconds() instead..