pythoncomparisonvisualizationupsetplot

Obtaining a HashMap or dictionary and a diagram in Python to visualize the overlaps between multiple lists


Context: I roughly have a dictionary of about 130 lists in the form of a key and a list of indexes.

{‘key1’:[0,1,2], ‘key2’: [2, 3, 4], ‘key3’:[5, 6],…, ‘key130’:[0, 450, 1103, 500,…]}

Lists are all different sizes.

This is a two-part problem where:

  1. I want some form of data structure to store the number of overlaps between lists

  2. If possible, I want a diagram that shows the overlap

PART 1:

The most similar StackOverflow questions answers were that we could find list similarities by utilizing set.intersection

List1 = [10,10,11,12,15,16,18,19]

List2 = [10,11,13,15,16,19,20]

List3 = [10,11,11,12,15,19,21,23]

print(set(List1).intersection(List2)) #compare between list 2 and 3

Which gives you:

set([10, 11, 15, 16, 19])

I could then use a for loop to traverse through each list to compare it with the next list in the dictionary and get the length of the list. This would then give me a dictionary such as:

{‘key1_key2’:1, ‘key2_key3’:0, ‘key3_key4’…, ‘key130_key1’: [29]}

PART 2:

I have in my head that a comparison table would be the best to visualize the similarities:


    Key1    Key2    Key3    …   Key130
Key1    X   X   X       X
Key2    0   X   X       X
Key3    4   6   X       X
…               X   …
Key130                  X

However, I couldn’t find many results on how this can be achieved.

Another option was UpSetPlot as it can allow for pretty nice yet perhaps a little excessive comparison in this case: https://upsetplot.readthedocs.io/en/stable/

Of course, I’m sure both diagrams would need the similarities result to be stored a bit differently? I’m not too sure for the Comparison Table but UpSetPlot would need the dictionary (?) to be a pandaSeries. I would be interested in both diagrams to test how it would look.

Reproducible Example:

{'key1': [10,10,11,12,15,16,18,19], 'key2': [10,11,13,15,16,19,20], 'key3':[10,11,11,12,15,19,21,23], 'key4':[], 'key5':[0], 'key6':[10,55,66,77]}

Some of the more useful resources I looked at:
How to compare more than 2 Lists in Python? Python -Intersection of multiple lists? Python comparing multiple lists into Comparison Table

If there are some other sites that I missed that would be applicable to this Q, please let me know. Thank you in advance!


Solution

  • import numpy as np
    import pandas as pd
    
    d = {'key1':[0,1,2], 'key2': [2, 3, 4], 'key3':[5, 6]}
    s = []
    [s.append(list(set(x) & set(y))) for x in d.values() for y in d.values()]
    
    matrix1 = np.array(s, dtype = object)
    matrix2 = matrix1.reshape(int(np.sqrt(len(matrix1))),int(np.sqrt(len(matrix1))))
    matrix2 = np.vectorize(len)(matrix2)
    
    df = pd.DataFrame(matrix2)
    df.columns = d.keys()
    df.index = d.keys()
    
    print(df)
    

    Output:

          key1  key2  key3
    key1     3     1     0
    key2     1     3     0
    key3     0     0     2
    

    Definitely not the solution with the best performance. But it is easy to implement.