javaperformancecollectionsjava-stream

Compare two Collections populated with different kinds of objects


I have a performance issue trying to compare two big Collections and I'm looking for some help to find a better way to do that.

The classes:

public class TypeOne {
   private int id;
}

and

public class TypeTwo {
   private int id;
}

The code:

Collection<TypeOne> oneColl = someMethodToPopulateThat();
Collection<TypeTwo> twoColl = anotherMethodToPopulateThat();

// Iterating both collections to match the elements by id
for(TypeOne one : oneColl) {
   for(TypeTwo two : twoColl) {
      if (one.getId().equals(two.getId())) 
         System.out.println(one.getId());
   }
}

I already tried to use some functions of Stream API without success.

Does anyone have any idea to solve this issue?


Solution

  • tl;dr

    ones.stream().forEach(
        one -> System.out.println(
            twos.stream().filter( two -> two.getId() == one.getId() ).findAny().toString()
        )
    )
    

    Details

    I assume sorting as a NavigableSet will improve performance of our searching, though I’ve not verified this attempt at optimization works.

    NavigableSet < TypeOne > ones = new TreeSet <>( Comparator.comparingInt( TypeOne :: getId ) );
    ones.addAll( collectionOfOnes ) ;
    
    NavigableSet < TypeTwo > twos = new TreeSet <>( Comparator.comparingInt( TypeTwo :: getId ) );
    twos.addAll( collectionOfTwos ) ;
    

    Loop one navigable set while searching for a match in the other.

    for( TypeOne one : ones )
    {
        Optional<TypeTwo> optionalTwo = twos.stream().filter( two -> two.getId() == one.getId() ).findAny() ;
        // handle your Optional which may or may not contain an object. 
    }
    

    stream vs parallelStream

    Here is a full code example.

    record TypeOne( int id ) { }
    record TypeTwo( int id ) { }
    
    final int limit = 50_000;
    SequencedCollection < TypeOne > sourceOfOnes = IntStream.rangeClosed ( 1 , limit ).mapToObj ( TypeOne :: new ).toList ( );
    SequencedCollection < TypeTwo > sourceOfTwos = IntStream.rangeClosed ( 1 , limit ).mapToObj ( TypeTwo :: new ).toList ( );
    
    NavigableSet < TypeOne > ones = new TreeSet <> ( Comparator.comparingInt ( TypeOne :: id ) );
    ones.addAll ( sourceOfOnes );
    
    NavigableSet < TypeTwo > twos = new TreeSet <> ( Comparator.comparingInt ( TypeTwo :: id ) );
    twos.addAll ( sourceOfTwos );
    
    long start = System.nanoTime ( );
    for ( TypeOne one : ones )
    {
        Optional < TypeTwo > optionalTwo = twos.parallelStream ( ).filter ( two -> two.id ( ) == one.id ( ) ).findAny ( );
        if ( optionalTwo.isEmpty ( ) )
        {
            System.out.println ( "FAILED to find " + one );
        }
    }
    Duration elapsed = Duration.ofNanos ( System.nanoTime ( ) - start );
    System.out.println ( "elapsed = " + elapsed );
    

    No fancy testing, I just executed this around five times on a MacBook Pro (18,1) with Apple M1 Pro chip, 10 cores (8 performance and 2 efficiency), 16 GB RAM, macOS Sonoma 14.7.2, Java 23.0.1, IntelliJ IDEA 2024.3.1 (Ultimate Edition), relatively quiet (little other CPU use). I ran two batches of tests, each batch runs the stream in one thread versus in parallel:

    Results:

    stream parallelStream
    ≈ 7.5 seconds ≈ 4 seconds