listgroovytime-complexityset-difference

Why is the Groovy method minus() so slow with lists of numbers?


Let's say I have two lists left and right of size n, each with unique integers, and I want to find the intersection and differences. This should be able to be done in O(n) or O(nlog(n)). This is how I instantiate the lists:

Integer n = 10_000
Integer overlap = (Integer) (0.1 * n)
List left = (1..n).toList().shuffled()
List right = (n-overlap+1..2*n-overlap).toList().shuffled()

One way to get the intersection and differences would be for example:

Set setLeft = left

Set onlyInRight = right.findAll { !(it in setLeft) }
Set inBoth = right.findAll { !(it in onlyInRight) }
Set onlyInLeft = left.findAll { !(it in inBoth) }

This works quite well, but I find the following more readable:

List inBoth = left.intersect(right)
List onlyInLeft = left - inBoth
List onlyInRight = right - inBoth

Interestingly, when I played around a bit the latter apporach was pretty slow and felt more like O(n2). So I took a peek at the code base of the minus method:

public static <T> Collection<T> minus(Iterable<T> self, Iterable<?> removeMe, Comparator<? super T> comparator) {
        Collection<T> self1 = asCollection(self);
        Collection<?> removeMe1 = asCollection(removeMe);
        Collection<T> ansCollection = createSimilarCollection(self1);
        if (self1.isEmpty())
            return ansCollection;
        T head = self1.iterator().next();

        boolean nlgnSort = sameType(new Collection[]{self1, removeMe1});

        // We can't use the same tactic as for intersection
        // since AbstractCollection only does a remove on the first
        // element it encounters.

        if (nlgnSort && (head instanceof Comparable)) {
            //n*LOG(n) version
            Set<T> answer;
            if (head instanceof Number) {
                answer = new TreeSet<>(comparator);
                answer.addAll(self1);
// ------------ Beginning of relevant part --
                for (T t : self1) {                     // A) -- loop over self1
                    if (t instanceof Number) {
                        for (Object t2 : removeMe1) {   // B) -- inner loop over removeMe1
                            if (t2 instanceof Number) {
                                if (comparator.compare(t, (T) t2) == 0)
                                    answer.remove(t);
                            }
                        }
                    } else {
                        if (removeMe1.contains(t))
                            answer.remove(t);
                    }
                }
// ------------ End of relevant part --
            } else {
                answer = new TreeSet<>(comparator);
                answer.addAll(self1);
                answer.removeAll(removeMe1);
            }

            for (T o : self1) {
                if (answer.contains(o))
                    ansCollection.add(o);
            }
        } else {
            //n*n version
            // ...

The relevant part is between the comments (from me)

Is it just me or is this O(n2) for instance of Number? Maybe there is a reason for this, but instead, I would have expected something like

// ------------ Beginning of relevant part --
                for (Object t : removeMe1) {
                    answer.remove(t);
                }
// ------------ End of relevant part --

So if I am not mistaken minus() works faster if I wrap the numbers in a custom value class implementing Comparable. Has anybody an idea what I am missing or why it was implemented like this?


Solution

  • Paul King answered instantly in this ticket. Short version in my own words:

    If you have different types of numbers in a list, Groovy's minus() is clever and tries to remove all objects of the same mathmatical value regardless of type. Equality checks are needed, hence O(n2 log(n)).

    Example:

    assert [1, 2, 2.0f, 2.0G] - [2.0d, 1.0G] == []
    

    If you do not want this behaviour you can use the Java streams solution from @dagget or the example using Sets in the question.