javamultithreadingparallelstream

Java ParallelStream: several map or single map


Introduction

I'm currently developing a program in which I use Java.util.Collection.parallelStream(), and wondering if it's possible to make it more Multi-threaded.

Several small map

I was wondering if using multiple map might allow the Java.util.Collection.parallelStream() to distribute the tasks better:

List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
        .map(gson::toJson)
        .map(Document::parse)
        .map(InsertOneModel::new)
        .toList();

Single big map

For example a better distribution than:

List<InsertOneModel<Document>> bulkWrites = puzzles.parallelStream()
        .map(puzzle -> new InsertOneModel<>(Document.parse(gson.toJson(puzzle))))
        .toList();

Question

Is there one of the solutions that is more suitable for Java.util.Collection.parallelStream(), or the two have no big difference?


Solution

  • I don't think it will do any better if you chain it with multiple maps. In case your code is not very complex I would prefer to use a single big map.
    To understand this we have to check the code inside the map function. link

    public final <R> Stream<R> map(Function<? super P_OUT, ? extends R> mapper) {
        Objects.requireNonNull(mapper);
        return new StatelessOp<P_OUT, R>(this, StreamShape.REFERENCE,
                                     StreamOpFlag.NOT_SORTED | StreamOpFlag.NOT_DISTINCT) {
            @Override
            Sink<P_OUT> opWrapSink(int flags, Sink<R> sink) {
                return new Sink.ChainedReference<P_OUT, R>(sink) {
                    @Override
                    public void accept(P_OUT u) {
                        downstream.accept(mapper.apply(u));
                    }
                };
            }
        };
    }
    

    As you can see a lot many things happen behind the scenes. Multiple objects are created and multiple methods are called. Hence, for each chained map function call all these are repeated.

    Now coming back to ParallelStreams, they work on the concept of Parallelism .
    Streams Documentation

    enter image description here
    A parallel stream is a stream that splits its elements into multiple chunks, processing each chunk with a different thread. Thus, you can automatically partition the workload of a given operation on all the cores of your multicore processor and keep all of them equally busy.

    Parallel streams internally use the default ForkJoinPool, which by default has as many threads as you have processors, as returned by Runtime.getRuntime().availableProcessors(). But you can change the size of this pool using the system property java.util.concurrent.ForkJoinPool.common.parallelism.

    ParallelStream calls spliterator() on the collection object which returns a Spliterator implementation that provides the logic of splitting a task. Every source or collection has their own spliterator implementations. Using these spliterators, parallel stream splits the task as long as possible and finally when the task becomes too small it executes it sequentially and merges partial results from all the sub tasks.

    So I would prefer parallelStream when

    Performance Implications

    Conclusion: A large amount of data and many computations done per element indicate that parallelism could be a good option.