pythonpandasminhash

How to get the Intersection and Union of two Series in Pandas with non-unique values?


If I have 2 Series objects, like so: [0,0,1] [1,0,0] How would I get the intersection and union of the two? They only contain booleans which means they are non-unique values.

I have a large Boolean matrix. I've minhashed it and now I'm trying to find the false positives and negatives which I think means that I have to get the Jaccard similarity for each original pair.


Solution

  • Since you say they are booleans use logical_and and logical_or of numpy or & and | on series i.e

    y1 = pd.Series([1,0,1,0])
    y2 = pd.Series([1,0,0,1])
    
    # Numpy approach 
    intersection = np.logical_and(y1.values, y2.values)
    union = np.logical_or(y1.values, y2.values)
    intersection.sum() / union.sum()
    # 0.33333333333333331
    
    # Pandas approach 
    sum(y1 & y2) / sum(y1 | y2)
    # 0.33333333333333331