Let's say I have two lists left
and right
of size n, each with unique integers, and I want to find the intersection and differences. This should be able to be done in O(n) or O(nlog(n)). This is how I instantiate the lists:
Integer n = 10_000
Integer overlap = (Integer) (0.1 * n)
List left = (1..n).toList().shuffled()
List right = (n-overlap+1..2*n-overlap).toList().shuffled()
One way to get the intersection and differences would be for example:
Set setLeft = left
Set onlyInRight = right.findAll { !(it in setLeft) }
Set inBoth = right.findAll { !(it in onlyInRight) }
Set onlyInLeft = left.findAll { !(it in inBoth) }
This works quite well, but I find the following more readable:
List inBoth = left.intersect(right)
List onlyInLeft = left - inBoth
List onlyInRight = right - inBoth
Interestingly, when I played around a bit the latter apporach was pretty slow and felt more like O(n2). So I took a peek at the code base of the minus method:
public static <T> Collection<T> minus(Iterable<T> self, Iterable<?> removeMe, Comparator<? super T> comparator) {
Collection<T> self1 = asCollection(self);
Collection<?> removeMe1 = asCollection(removeMe);
Collection<T> ansCollection = createSimilarCollection(self1);
if (self1.isEmpty())
return ansCollection;
T head = self1.iterator().next();
boolean nlgnSort = sameType(new Collection[]{self1, removeMe1});
// We can't use the same tactic as for intersection
// since AbstractCollection only does a remove on the first
// element it encounters.
if (nlgnSort && (head instanceof Comparable)) {
//n*LOG(n) version
Set<T> answer;
if (head instanceof Number) {
answer = new TreeSet<>(comparator);
answer.addAll(self1);
// ------------ Beginning of relevant part --
for (T t : self1) { // A) -- loop over self1
if (t instanceof Number) {
for (Object t2 : removeMe1) { // B) -- inner loop over removeMe1
if (t2 instanceof Number) {
if (comparator.compare(t, (T) t2) == 0)
answer.remove(t);
}
}
} else {
if (removeMe1.contains(t))
answer.remove(t);
}
}
// ------------ End of relevant part --
} else {
answer = new TreeSet<>(comparator);
answer.addAll(self1);
answer.removeAll(removeMe1);
}
for (T o : self1) {
if (answer.contains(o))
ansCollection.add(o);
}
} else {
//n*n version
// ...
The relevant part is between the comments (from me)
// -- Beginning of relevant part --
and// -- End of relevant part --
.Is it just me or is this O(n2) for instance of Number
? Maybe there is a reason for this, but instead, I would have expected something like
// ------------ Beginning of relevant part --
for (Object t : removeMe1) {
answer.remove(t);
}
// ------------ End of relevant part --
So if I am not mistaken minus()
works faster if I wrap the numbers in a custom value class implementing Comparable
. Has anybody an idea what I am missing or why it was implemented like this?
Paul King answered instantly in this ticket. Short version in my own words:
If you have different types of numbers in a list, Groovy's minus()
is clever and tries to remove all objects of the same mathmatical value regardless of type. Equality checks are needed, hence O(n2 log(n)).
Example:
assert [1, 2, 2.0f, 2.0G] - [2.0d, 1.0G] == []
If you do not want this behaviour you can use the Java streams solution from @dagget or the example using Sets in the question.