I have a performance issue trying to compare two big Collections and I'm looking for some help to find a better way to do that.
The classes:
public class TypeOne {
private int id;
}
and
public class TypeTwo {
private int id;
}
The code:
Collection<TypeOne> oneColl = someMethodToPopulateThat();
Collection<TypeTwo> twoColl = anotherMethodToPopulateThat();
// Iterating both collections to match the elements by id
for(TypeOne one : oneColl) {
for(TypeTwo two : twoColl) {
if (one.getId().equals(two.getId()))
System.out.println(one.getId());
}
}
I already tried to use some functions of Stream API without success.
Does anyone have any idea to solve this issue?
ones.stream().forEach(
one -> System.out.println(
twos.stream().filter( two -> two.getId() == one.getId() ).findAny().toString()
)
)
I assume sorting as a NavigableSet
will improve performance of our searching, though I’ve not verified this attempt at optimization works.
NavigableSet < TypeOne > ones = new TreeSet <>( Comparator.comparingInt( TypeOne :: getId ) );
ones.addAll( collectionOfOnes ) ;
NavigableSet < TypeTwo > twos = new TreeSet <>( Comparator.comparingInt( TypeTwo :: getId ) );
twos.addAll( collectionOfTwos ) ;
Loop one navigable set while searching for a match in the other.
for( TypeOne one : ones )
{
Optional<TypeTwo> optionalTwo = twos.stream().filter( two -> two.getId() == one.getId() ).findAny() ;
// handle your Optional which may or may not contain an object.
}
stream
vs parallelStream
Here is a full code example.
record TypeOne( int id ) { }
record TypeTwo( int id ) { }
final int limit = 50_000;
SequencedCollection < TypeOne > sourceOfOnes = IntStream.rangeClosed ( 1 , limit ).mapToObj ( TypeOne :: new ).toList ( );
SequencedCollection < TypeTwo > sourceOfTwos = IntStream.rangeClosed ( 1 , limit ).mapToObj ( TypeTwo :: new ).toList ( );
NavigableSet < TypeOne > ones = new TreeSet <> ( Comparator.comparingInt ( TypeOne :: id ) );
ones.addAll ( sourceOfOnes );
NavigableSet < TypeTwo > twos = new TreeSet <> ( Comparator.comparingInt ( TypeTwo :: id ) );
twos.addAll ( sourceOfTwos );
long start = System.nanoTime ( );
for ( TypeOne one : ones )
{
Optional < TypeTwo > optionalTwo = twos.parallelStream ( ).filter ( two -> two.id ( ) == one.id ( ) ).findAny ( );
if ( optionalTwo.isEmpty ( ) )
{
System.out.println ( "FAILED to find " + one );
}
}
Duration elapsed = Duration.ofNanos ( System.nanoTime ( ) - start );
System.out.println ( "elapsed = " + elapsed );
No fancy testing, I just executed this around five times on a MacBook Pro (18,1) with Apple M1 Pro chip, 10 cores (8 performance and 2 efficiency), 16 GB RAM, macOS Sonoma 14.7.2, Java 23.0.1, IntelliJ IDEA 2024.3.1 (Ultimate Edition), relatively quiet (little other CPU use). I ran two batches of tests, each batch runs the stream in one thread versus in parallel:
twos.stream ( )
twos.parallelStream ( )
Results:
stream |
parallelStream |
---|---|
≈ 7.5 seconds | ≈ 4 seconds |