java performance collections java-stream

Compare two Collections populated with different kinds of objects

I have a performance issue trying to compare two big Collections and I'm looking for some help to find a better way to do that.

The classes:

public class TypeOne {
   private int id;
}

and

public class TypeTwo {
   private int id;
}

The code:

Collection<TypeOne> oneColl = someMethodToPopulateThat();
Collection<TypeTwo> twoColl = anotherMethodToPopulateThat();

// Iterating both collections to match the elements by id
for(TypeOne one : oneColl) {
   for(TypeTwo two : twoColl) {
      if (one.getId().equals(two.getId())) 
         System.out.println(one.getId());
   }
}

I already tried to use some functions of Stream API without success.

Does anyone have any idea to solve this issue?

Solution

tl;dr

ones.stream().forEach(
    one -> System.out.println(
        twos.stream().filter( two -> two.getId() == one.getId() ).findAny().toString()
    )
)

Details

I assume sorting as a NavigableSet will improve performance of our searching, though I’ve not verified this attempt at optimization works.

NavigableSet < TypeOne > ones = new TreeSet <>( Comparator.comparingInt( TypeOne :: getId ) );
ones.addAll( collectionOfOnes ) ;

NavigableSet < TypeTwo > twos = new TreeSet <>( Comparator.comparingInt( TypeTwo :: getId ) );
twos.addAll( collectionOfTwos ) ;

Loop one navigable set while searching for a match in the other.

for( TypeOne one : ones )
{
    Optional<TypeTwo> optionalTwo = twos.stream().filter( two -> two.getId() == one.getId() ).findAny() ;
    // handle your Optional which may or may not contain an object. 
}

`stream` vs `parallelStream`

Here is a full code example.

record TypeOne( int id ) { }
record TypeTwo( int id ) { }

final int limit = 50_000;
SequencedCollection < TypeOne > sourceOfOnes = IntStream.rangeClosed ( 1 , limit ).mapToObj ( TypeOne :: new ).toList ( );
SequencedCollection < TypeTwo > sourceOfTwos = IntStream.rangeClosed ( 1 , limit ).mapToObj ( TypeTwo :: new ).toList ( );

NavigableSet < TypeOne > ones = new TreeSet <> ( Comparator.comparingInt ( TypeOne :: id ) );
ones.addAll ( sourceOfOnes );

NavigableSet < TypeTwo > twos = new TreeSet <> ( Comparator.comparingInt ( TypeTwo :: id ) );
twos.addAll ( sourceOfTwos );

long start = System.nanoTime ( );
for ( TypeOne one : ones )
{
    Optional < TypeTwo > optionalTwo = twos.parallelStream ( ).filter ( two -> two.id ( ) == one.id ( ) ).findAny ( );
    if ( optionalTwo.isEmpty ( ) )
    {
        System.out.println ( "FAILED to find " + one );
    }
}
Duration elapsed = Duration.ofNanos ( System.nanoTime ( ) - start );
System.out.println ( "elapsed = " + elapsed );

No fancy testing, I just executed this around five times on a MacBook Pro (18,1) with Apple M1 Pro chip, 10 cores (8 performance and 2 efficiency), 16 GB RAM, macOS Sonoma 14.7.2, Java 23.0.1, IntelliJ IDEA 2024.3.1 (Ultimate Edition), relatively quiet (little other CPU use). I ran two batches of tests, each batch runs the stream in one thread versus in parallel:

twos.stream ( )
twos.parallelStream ( )

Results:

`stream`	`parallelStream`
≈ 7.5 seconds	≈ 4 seconds

Compare two Collections populated with different kinds of objects

tl;dr

Details

stream vs parallelStream

`stream` vs `parallelStream`