javajava-stream

Do Java streams allocate significant space while streaming over an array?


Let's say we have a piece of code such as:

Arrays.stream(queries)
  .limit(queries.length - 1).mapToInt(i -> i)
  .sum();

Where queries is an array of N integers. To clarify the question, let's say it's 1 million integers, so the array will take up ~4MB (1M * 4 bytes per integer).

Will the stream take up any considerable space? Or will we use roughly 4MB and stream through the array without having to re-allocate entire array to run the following code (not considering the space we need to run the JVM)?


Solution

  • The answer is:

    [A] an implementation detail. The java spec simply isn't going to tell you, and therefore any exact answer needs to be caveated with '.. on this hardware, this OS, this VM impl, this version, under these circumstances'. However...

    [B] Whatever the answer is, it is both 'pretty fast / not much space', and it is definitely 'not dependent on the value of N'.

    The 'Stream' in stream is not chosen for fun: The stream API does, in fact, stream. It does not take the entire array, then make a new object containing all values ready for streaming, then limit makes yet another new giant array (one size smaller), and then mapToInt makes yet another. That is not how it works.

    Stream is a pipeline. Nothing happens until you run a terminal command (sum is terminal). You can check this:

    Arrays.stream(queries).mapToInt(i -> {
      System.out.println(i);
      return i.intValue();
    });
    

    This doesn't print anything. At all. Because this is just a stream process half-baked, no terminal, it doesn't 'flow'.

    If you invoke sum on the above, THEN the prints start happening. Specifically, the terminal (sum(), here) start 'pulling values from the stream'. This travels up. sum asks mapToInt for a value, and to do this, mapToInt asks limit for a value (and will then take that value, roll it through the i -> i lambda, and provide that to sum). limit will then ask Arrays.stream for a value and that will then actually read a single item from the array. There ARE intermediate tracker objects involved, but their size doesn't depend on N. For example, the object returned by Arrays.stream(queries) holds a ref to the queries array (about 64 bits of data regardless of how large that array is; just a pointer), and an int value that knows where we're at 1.

    The object representing the limit part of it all just has a single int that tracks how many values have been provided so far. limit acts like there are no more values to provide when the thing it is pulling from runs out or limit items have already been provided, whichever occurs sooner.

    And so on. Thus, exactly how big those tracker objects are is an implementation detail, but, they are 'small' (at least, relative to a million-ints array!), and do not depend on the size of the stream. In fact, literally infinite streams can exist, no problem. And they do - check the API of Stream itself, where you can rather easily make a stream that returns an infinite amount of 1 values, for example.

    [1] I'm oversimplifying. Streams also have the property that depending on a few circumstances, they can be parallellized. When you involve counters, parallelizing becomes very difficult, so these trackers are a bit more complicated. If you want the full details, look at Spliterator, and StreamUtils. But, this oversimplified explanation is sufficient for understanding that no amount of stream intermediate ops are going to put you at risk of running out of memory.