pythonarrayssetset-intersection

Demonstration of sets being faster than arrays?


The goal is to find the intersection of 2 arrays. I know you're supposed to use sets to get a faster runtime. However, in my notebook I was surprised to find that my runtime is not faster with the sets method. (89 ms (array) vs. 220 ms (sets)). Is there something wrong with my code or something I'm misunderstanding?

A = range(50,1000000)
B = range(-1000000,60)

%%time
# O(m*n)?
def intersection(A, B):
  array = []
  for i in A:
    if i in B:
      array.append(i)
  return array

intersection(A,B)

%%time
# O(1)?
def intersection(A, B):
  set_a = set(A)
  set_b = set(B)
  return [i for i in set_a if i in set_b]
  # return set_a.intersection(set_b)

intersection(A,B)

Solution

  • You are not comparing arrays to sets.

    You are comparing ranges to sets.

    Checking the membership in a range is almost instantaneous (two comparisons: one with the lower, one with the upper bound).

    Checking the membership in an array would take millions of iterations in this case, which would be substantially longer than membership checks for sets.

    If you convert the ranges to lists first, you see a huge difference:

    import time
    
    A = range(50,100000)
    B = range(-100000,60)
    
    def intersectionArrays(A, B):
      array = []
      arr_a = list(A)
      arr_b = list(B)
      for i in arr_a:
        if i in arr_b:
          array.append(i)
      return array
    
    
    def intersectionSets(A, B):
      set_a = set(A)
      set_b = set(B)
      return [i for i in set_a if i in set_b]
      # return set_a.intersection(set_b)
    
    
    
    startArr = time.process_time()
    intersectionArrays(A,B)
    endArr = time.process_time()
    
    startSet = time.process_time()
    intersectionSets(A,B)
    endSet = time.process_time()
    
    print(endArr - startArr)
    print(endSet - startSet)
    

    It prints something like

    Arrays: 39.871636528
    Sets:   0.008161553000000765
    

    There is no point trying it with 1000000 - it would take hours.

    Furthermore, in order to measure something vaguely realistic, you should shuffle the numbers, otherwise branch prediction might distort the results.