javaconcurrencyparallel-processingjava-streamparallelstream

Parallelising Java 8 parallelStream independent filters


I read all the related topics on StackOverflow and other sites to understand the mechanism of the parallelStream of Java8, and to understand whether the applied filters are running concurrently or not. Are they running in a parallel way without waiting for each other or not? Does it happen only when you have separated filters or not?

I read:

Java 8 Streams: multiple filters vs. complex condition

Java 8 Streams - Filter More Than One Condition

JavaScript: multiple filters chain vs. complex condition

Parallelising Java 8 Stream.filter()?

Filter and foreach java 8 stream as parallel

I could not find a simple answer to this simple question.

I did find answers about the performance differences and other things that related to the question whether to choose only one filter with a complex condition or multiple filters with one simple condition.

But this is not what I am trying to find out.

My question is: When I write something like:

Optional<List<Pokemon>> filteredPokemons = getPokemonsList()
            .map(pokemons -> pokemons.parallelStream() //parallel filter for each pokemon
                    .filter(pokemon -> isRed(pokemon.getColor()) && isAmazing(pokemon) && !pokemonAlreadyActivated(pokemon))
                    .collect(Collectors.toList()));

What is really happening , filter wise? Is the first condition in the filter, isRed(..), gets checked first and if it returns false, it doesn't proceed to check the rest of the conditions?

What will happen if I'll split this one filter into 3 different filters:

Optional<List<Pokemon>> filteredPokemons = getPokemonsList()
    .map(pokemons -> pokemons.parallelStream()
            .filter(pokemon -> isRed(pokemon.getColor()))
            .filter(pokemon -> isAmazing(pokemon))
            .filter(pokemon -> !pokemonAlreadyActivated(pokemon))
            .collect(Collectors.toList()));

I know that performance wise, the first option, with only one filter and complex condition is preferred (reference: Java 8 Streams: multiple filters vs. complex condition).

But what about concurrency? Because logically, it is seems like in the second option, with split filters, they can run in a parallel way without waiting for each other, so it should be faster and better, so how does the first option is still considered better?

Doesn't the first option makes the conditions wait for each other to be checked, unlike the separated filters which can run in a concurrent and parallel way without the need to depend on each other.

I'll be glad to understand the difference and to understand if separated filters are running concurrently unlike one filter with several conditions in it.


Solution

  • Consider the following example:

    stringList.parallelStream()
        .filter(string -> string != null)
        .filter(string -> !string.isEmpty())
        …
    

    This example is semantically equivalent to

    stringList.parallelStream()
        .filter(string -> string != null && !string.isEmpty())
        …
    

    and it should be clear that we have to obey the semantic of &&, as we can’t evaluate the second condition before we know whether the first condition is fulfilled, as otherwise, the code would break.

    Since the Stream implementation doesn’t know what the condition evaluation will do and whether they depend on each other like shown in the example, as all it gets is an opaque Predicate implementation, it has to assume that there could be such a dependency. So, the evaluation of the predicates can not be done in parallel for one element.

    In theory, the second filter stage could process an element which passed the filter while the first filter stage processes the next element, but as said in the answer you already linked yourself, this road has not been taken.

    As far as I know, there were experiments in that direction in early development phases, which led to the conclusion that the overhead is too high.

    Instead, the workload is split at the source, like splitting the list into sublists, and the parts are processed sequentially, passing each element through the pipeline in a fused operation. This per-element operation requires no thread synchronization in the best case and is JIT friendly.