rubyset-difference

How to do sane "set-difference" in Ruby?


Demo (I expect result [3]):

[1,2] - [1,2,3] => []    # Hmm
[1,2,3] - [1,2] => [3]   # I see

a = [1,2].to_set   => #<Set: {1, 2}>
b = [1,2,3].to_set => #<Set: {1, 2, 3}>
a - b              => #<Set: {}>  WTF!

And:

[1,2,9] - [1,2,3] => [9]  # Hmm. Would like [[9],[3]]

How is one to perform a real set difference regardless of order of the inputs?

Ps. As an aside, I need to do this for two 2000-element arrays. Usually, array #1 will have fewer elements than array #2, but this is not guaranteed.


Solution

  • The - operator applied to two arrays a and b gives the relative complement of b in a (items that are in a but not in b).

    What you are looking for is the symmetric difference of two sets (the union of both relative complements between the two). This will do the trick:

    a = [1, 2, 9]
    b = [1, 2, 3]
    a - b | b - a          # => [3, 9]
    

    If you are operating on Set objects, you may use the overloaded ^ operator:

    c = Set[1, 2, 9]
    d = Set[1, 2, 3]
    c ^ d                  # => #<Set: {3, 9}>
    

    For extra fun, you could also find the relative complement of the intersection in the union of the two sets:

    ( a | b ) - ( a & b )  # => #<Set: {3, 9}>
    

    Benchmarks

    Setup

    WIDTH_OF_LABELS = 10.freeze
    REPLICATION     = 10000.freeze
    SIZE_OF_ARRAY   = 2000.freeze
    
    a = (1..SIZE_OF_ARRAY).to_a
    b = (SIZE_OF_ARRAY/2..SIZE_OF_ARRAY*1.5).to_a # Ensures half overlapping.
    
    a_set = Set.new( a )
    b_set = Set.new( b )
    
    Benchmark.bm( WIDTH_OF_LABELS ) do |benchmark|
      benchmark.report( "a - b | b - a" ) do
        REPLICATION.times do
          a - b | b - a   
        end
      end
    
      benchmark.report( "Converting to Sets" ) do
        REPLICATION.times do
          c = Set.new( a )
          d = Set.new( b )
          c ^ d  
        end
      end
    
      benchmark.report( "Already Sets" ) do
        REPLICATION.times do
          a_set ^ b_set
        end
      end
    
      benchmark.report( "( a | b ) - ( a & b )" ) do
        REPLICATION.times do
          ( a | b ) - ( a & b )
        end
      end
    end
    

    Results

                               user     system      total        real
    a - b | b - a          2.289062   0.009412   2.298474 (  2.299243)
    Converting to Sets     7.185954   0.052709   7.238663 (  7.247614)
    Already Sets           3.372465   0.060009   3.432474 (  3.433157)
    ( a | b ) - ( a & b )  2.728917   0.035183   2.764100 (  2.764778)
    

    Fastest approach is a - b | b - a.

    You can see I ran one where the Sets were already created versus one where we needed to convert the Arrays into Sets first.

    When you already are working with Sets, it is more than 2x faster than converting to Sets but it is still slower than just using a - b | b - a.