javaperformancejvmgarbage-collectiongarbage

Why does this test take longer without garbage collection overhead?


I ran into this scenario in the process of developing a lightweight library for asynchronous messaging. Trying to get an idea of the cost of creating lots of medium sized objects with short lifetimes, I wrote the following test:

import java.nio.ByteBuffer;
import java.util.Random;


public class MemPressureTest {
    static final int SIZE = 4096;
    static final class Bigish {
        final ByteBuffer b;


        public Bigish() {
            this(ByteBuffer.allocate(SIZE));
        }

        public Bigish(ByteBuffer b) {
            this.b = b;
        }

        public void fill(byte bt) {
            b.clear();
            for (int i = 0; i < SIZE; ++i) {
                b.put(bt);
            }
        }
    }


    public static void main(String[] args) {
        Random random = new Random(1);
        Bigish tmp = new Bigish();
        for (int i = 0; i < 3e7; ++i) {
            tmp.fill((byte)random.nextInt(255));
        }
    }
}

On my laptop, with default GC settings, it runs in about 95 seconds:

/tmp$ time java -Xlog:gc MemPressureTest
[0.006s][info][gc] Using G1

real    1m35.552s
user    1m33.658s
sys 0m0.428s

This is where things get strange. I tweaked the program to allocate a new object for each iteration:

...
        Random random = new Random(1);
        for (int i = 0; i < 3e7; ++i) {
            Bigish tmp = new Bigish();
            tmp.fill((byte)random.nextInt(255));
        }
...

In theory, this should add some small overhead, but none of the objects should ever be promoted out of Eden. At best, I'd expect the runtimes to be close to identical. However, this test completes in ~17 seconds:

/tmp$ time java -Xlog:gc MemPressureTest
[0.007s][info][gc] Using G1
[0.090s][info][gc] GC(0) Pause Young (Normal) (G1 Evacuation Pause) 23M->1M(130M) 1.304ms
[0.181s][info][gc] GC(1) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(130M) 0.870ms
[0.247s][info][gc] GC(2) Pause Young (Normal) (G1 Evacuation Pause) 76M->0M(130M) 0.844ms
[0.317s][info][gc] GC(3) Pause Young (Normal) (G1 Evacuation Pause) 75M->0M(130M) 0.793ms
[0.381s][info][gc] GC(4) Pause Young (Normal) (G1 Evacuation Pause) 75M->0M(130M) 0.859ms
[lots of similar GC pauses, snipped for brevity]
[16.608s][info][gc] GC(482) Pause Young (Normal) (G1 Evacuation Pause) 254M->0M(425M) 0.765ms
[16.643s][info][gc] GC(483) Pause Young (Normal) (G1 Evacuation Pause) 254M->0M(425M) 0.580ms
[16.676s][info][gc] GC(484) Pause Young (Normal) (G1 Evacuation Pause) 254M->0M(425M) 0.841ms

real    0m16.766s
user    0m16.578s
sys 0m0.576s

I ran both versions several times, with near identical results to the above. I feel like I must be missing something very obvious. Am I going insane? What could explain this difference in performance?

=== EDIT ===

I rewrote the test using JMH as per apangin and dan1st's suggestions:

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;

import java.nio.ByteBuffer;
import java.util.Random;


public class MemPressureTest {
    static final int SIZE = 4096;

    @State(Scope.Benchmark)
    public static class Bigish {
        final ByteBuffer b;
        private Blackhole blackhole;


        @Setup(Level.Trial)
        public void setup(Blackhole blackhole) {
            this.blackhole = blackhole;
        }

        public Bigish() {
            this.b = ByteBuffer.allocate(SIZE);
        }

        public void fill(byte bt) {
            b.clear();
            for (int i = 0; i < SIZE; ++i) {
                b.put(bt);
            }
            blackhole.consume(b);
        }
    }

    static Random random = new Random(1);


    @Benchmark
    public static void test1(Blackhole blackhole) {
        Bigish tmp = new Bigish();
        tmp.setup(blackhole);
        tmp.fill((byte)random.nextInt(255));
        blackhole.consume(tmp);
    }

    @Benchmark
    public static void test2(Bigish perm) {
        perm.fill((byte) random.nextInt(255));
    }
}

Still, the second test much slower:

> Task :jmh
# JMH version: 1.35
# VM version: JDK 18.0.1.1, OpenJDK 64-Bit Server VM, 18.0.1.1+2-6
# VM invoker: /Users/xxx/Library/Java/JavaVirtualMachines/openjdk-18.0.1.1/Contents/Home/bin/java
# VM options: -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/Users/xxx/Dev/MemTests/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.xxx.MemPressureTest.test1

# Run progress: 0.00% complete, ETA 00:16:40
# Fork: 1 of 5
# Warmup Iteration   1: 2183998.556 ops/s
# Warmup Iteration   2: 2281885.941 ops/s
# Warmup Iteration   3: 2239644.018 ops/s
# Warmup Iteration   4: 1608047.994 ops/s
# Warmup Iteration   5: 1992314.001 ops/s
Iteration   1: 2053657.571 ops/s3s]
Iteration   2: 2054957.773 ops/sm 3s]
Iteration   3: 2051595.233 ops/sm 13s]
Iteration   4: 2054878.239 ops/sm 23s]
Iteration   5: 2031111.214 ops/sm 33s]

# Run progress: 10.00% complete, ETA 00:15:04
# Fork: 2 of 5
# Warmup Iteration   1: 2228594.345 ops/s
# Warmup Iteration   2: 2257983.355 ops/s
# Warmup Iteration   3: 2063130.244 ops/s
# Warmup Iteration   4: 1629084.669 ops/s
# Warmup Iteration   5: 2063018.496 ops/s
Iteration   1: 1939260.937 ops/sm 33s]
Iteration   2: 1791414.018 ops/sm 43s]
Iteration   3: 1914987.221 ops/sm 53s]
Iteration   4: 1969484.898 ops/sm 3s]
Iteration   5: 1891440.624 ops/sm 13s]

# Run progress: 20.00% complete, ETA 00:13:23
# Fork: 3 of 5
# Warmup Iteration   1: 2228664.719 ops/s
# Warmup Iteration   2: 2263677.403 ops/s
# Warmup Iteration   3: 2237032.464 ops/s
# Warmup Iteration   4: 2040040.243 ops/s
# Warmup Iteration   5: 2038848.530 ops/s
Iteration   1: 2023934.952 ops/sm 14s]
Iteration   2: 2041874.437 ops/sm 24s]
Iteration   3: 2002858.770 ops/sm 34s]
Iteration   4: 2039727.607 ops/sm 44s]
Iteration   5: 2045827.670 ops/sm 54s]

# Run progress: 30.00% complete, ETA 00:11:43
# Fork: 4 of 5
# Warmup Iteration   1: 2105430.688 ops/s
# Warmup Iteration   2: 2279387.762 ops/s
# Warmup Iteration   3: 2228346.691 ops/s
# Warmup Iteration   4: 1438607.183 ops/s
# Warmup Iteration   5: 2059319.745 ops/s
Iteration   1: 1112543.932 ops/sm 54s]
Iteration   2: 1977077.976 ops/sm 4s]
Iteration   3: 2040147.355 ops/sm 14s]
Iteration   4: 1975766.032 ops/sm 24s]
Iteration   5: 2003532.092 ops/sm 34s]

# Run progress: 40.00% complete, ETA 00:10:02
# Fork: 5 of 5
# Warmup Iteration   1: 2240203.848 ops/s
# Warmup Iteration   2: 2245673.994 ops/s
# Warmup Iteration   3: 2096257.864 ops/s
# Warmup Iteration   4: 2046527.740 ops/s
# Warmup Iteration   5: 2050379.941 ops/s
Iteration   1: 2050691.989 ops/sm 35s]
Iteration   2: 2057803.100 ops/sm 45s]
Iteration   3: 2058634.766 ops/sm 55s]
Iteration   4: 2060596.595 ops/sm 5s]
Iteration   5: 2061282.107 ops/sm 15s]


Result "com.xxx.MemPressureTest.test1":
  1972203.484 ±(99.9%) 142904.698 ops/s [Average]
  (min, avg, max) = (1112543.932, 1972203.484, 2061282.107), stdev = 190773.683
  CI (99.9%): [1829298.786, 2115108.182] (assumes normal distribution)


# JMH version: 1.35
# VM version: JDK 18.0.1.1, OpenJDK 64-Bit Server VM, 18.0.1.1+2-6
# VM invoker: /Users/xxx/Library/Java/JavaVirtualMachines/openjdk-18.0.1.1/Contents/Home/bin/java
# VM options: -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/Users/xxx/Dev/MemTests/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.xxx.MemPressureTest.test2

# Run progress: 50.00% complete, ETA 00:08:22
# Fork: 1 of 5
# Warmup Iteration   1: 282751.407 ops/s
# Warmup Iteration   2: 283333.984 ops/s
# Warmup Iteration   3: 293785.079 ops/s
# Warmup Iteration   4: 268403.105 ops/s
# Warmup Iteration   5: 280054.277 ops/s
Iteration   1: 279093.118 ops/s9m 15s]
Iteration   2: 282782.996 ops/s9m 25s]
Iteration   3: 282688.921 ops/s9m 35s]
Iteration   4: 291578.963 ops/s9m 45s]
Iteration   5: 294835.777 ops/s9m 55s]

# Run progress: 60.00% complete, ETA 00:06:41
# Fork: 2 of 5
# Warmup Iteration   1: 283735.550 ops/s
# Warmup Iteration   2: 283536.547 ops/s
# Warmup Iteration   3: 294403.173 ops/s
# Warmup Iteration   4: 284161.042 ops/s
# Warmup Iteration   5: 281719.077 ops/s
Iteration   1: 276838.416 ops/s10m 56s]
Iteration   2: 284063.117 ops/s11m 6s]
Iteration   3: 282361.985 ops/s11m 16s]
Iteration   4: 289125.092 ops/s11m 26s]
Iteration   5: 294236.625 ops/s11m 36s]

# Run progress: 70.00% complete, ETA 00:05:01
# Fork: 3 of 5
# Warmup Iteration   1: 284567.336 ops/s
# Warmup Iteration   2: 283548.713 ops/s
# Warmup Iteration   3: 294317.511 ops/s
# Warmup Iteration   4: 283501.873 ops/s
# Warmup Iteration   5: 283691.306 ops/s
Iteration   1: 283462.749 ops/s12m 36s]
Iteration   2: 284120.587 ops/s12m 46s]
Iteration   3: 264878.952 ops/s12m 56s]
Iteration   4: 292681.168 ops/s13m 6s]
Iteration   5: 295279.759 ops/s13m 16s]

# Run progress: 80.00% complete, ETA 00:03:20
# Fork: 4 of 5
# Warmup Iteration   1: 284823.519 ops/s
# Warmup Iteration   2: 283913.207 ops/s
# Warmup Iteration   3: 294401.483 ops/s
# Warmup Iteration   4: 283998.027 ops/s
# Warmup Iteration   5: 283987.408 ops/s
Iteration   1: 278014.618 ops/s14m 17s]
Iteration   2: 283431.491 ops/s14m 27s]
Iteration   3: 284465.945 ops/s14m 37s]
Iteration   4: 293202.934 ops/s14m 47s]
Iteration   5: 290059.807 ops/s14m 57s]

# Run progress: 90.00% complete, ETA 00:01:40
# Fork: 5 of 5
# Warmup Iteration   1: 285598.809 ops/s
# Warmup Iteration   2: 284434.916 ops/s
# Warmup Iteration   3: 294355.547 ops/s
# Warmup Iteration   4: 284307.860 ops/s
# Warmup Iteration   5: 284297.362 ops/s
Iteration   1: 283676.043 ops/s15m 57s]
Iteration   2: 283609.750 ops/s16m 7s]
Iteration   3: 284575.124 ops/s16m 17s]
Iteration   4: 293564.269 ops/s16m 27s]
Iteration   5: 216267.883 ops/s16m 37s]


Result "com.xxx.MemPressureTest.test2":
  282755.844 ±(99.9%) 11599.112 ops/s [Average]
  (min, avg, max) = (216267.883, 282755.844, 295279.759), stdev = 15484.483
  CI (99.9%): [271156.731, 294354.956] (assumes normal distribution)

The JMH Blackhole should prevent code removal and the fact that JMH is now in charge of running separate iterations should prevent parallelization, right? Shouldn't Blackhole also stop the object from being confined to the stack? Also, wouldn't there be more variation between warmup iterations if hotspot were still doing a significant amount of optimization?


Solution

  • Creating a new ByteBuffer right before filling, indeed helps JIT compiler to produce better optimized code, when you use relative put methods, and here is why.

    1. JIT compilation unit is a method. HotSpot JVM does not perform a whole-program optimization, which is quite hard even in theory due to dynamic nature of Java and the open world runtime environment.
    2. When the JVM compiles test1 method, buffer instantiation appears in the same compilation scope as filling:
    Bigish tmp = new Bigish();
    tmp.setup(blackhole);
    tmp.fill((byte)random.nextInt(255));
    

    The JVM knows everything about the created buffer: its exact size and its backing array, it knows the buffer has not been published yet, no other thread sees it. So, the JVM can agressively optimize the fill loop: vectorize it using AVX instructions and unroll it to set 512 bytes at a time:

      0x000001cdf60c9ae0:   mov    %r9d,%r8d
      0x000001cdf60c9ae3:   movslq %r8d,%r9
      0x000001cdf60c9ae6:   add    %r11,%r9
      0x000001cdf60c9ae9:   vmovdqu %ymm0,0x10(%rcx,%r9,1)
      0x000001cdf60c9af0:   vmovdqu %ymm0,0x30(%rcx,%r9,1)
      0x000001cdf60c9af7:   vmovdqu %ymm0,0x50(%rcx,%r9,1)
      0x000001cdf60c9afe:   vmovdqu %ymm0,0x70(%rcx,%r9,1)
      0x000001cdf60c9b05:   vmovdqu %ymm0,0x90(%rcx,%r9,1)
      0x000001cdf60c9b0f:   vmovdqu %ymm0,0xb0(%rcx,%r9,1)
      0x000001cdf60c9b19:   vmovdqu %ymm0,0xd0(%rcx,%r9,1)
      0x000001cdf60c9b23:   vmovdqu %ymm0,0xf0(%rcx,%r9,1)
      0x000001cdf60c9b2d:   vmovdqu %ymm0,0x110(%rcx,%r9,1)
      0x000001cdf60c9b37:   vmovdqu %ymm0,0x130(%rcx,%r9,1)
      0x000001cdf60c9b41:   vmovdqu %ymm0,0x150(%rcx,%r9,1)
      0x000001cdf60c9b4b:   vmovdqu %ymm0,0x170(%rcx,%r9,1)
      0x000001cdf60c9b55:   vmovdqu %ymm0,0x190(%rcx,%r9,1)
      0x000001cdf60c9b5f:   vmovdqu %ymm0,0x1b0(%rcx,%r9,1)
      0x000001cdf60c9b69:   vmovdqu %ymm0,0x1d0(%rcx,%r9,1)
      0x000001cdf60c9b73:   vmovdqu %ymm0,0x1f0(%rcx,%r9,1)
      0x000001cdf60c9b7d:   mov    %r8d,%r9d
      0x000001cdf60c9b80:   add    $0x200,%r9d
      0x000001cdf60c9b87:   cmp    %r10d,%r9d
      0x000001cdf60c9b8a:   jl     0x000001cdf60c9ae0
    
    1. You use relative put method. It not only sets a byte in a ByteBuffer, but also updates the position field. Note that the above vectorized loop does not update the position in memory. JVM sets it just once after the loop - it is allowed to do so as long as nobody can observe an inconsistent state of the buffer.
    2. Now try to publish ByteBuffer before filling:
    Bigish tmp = new Bigish();
    volatileField = tmp;  // publish
    tmp.setup(blackhole);
    tmp.fill((byte)random.nextInt(255));
    

    The loop optimization breaks; now the array bytes are filled one by one, and the position field is incremented accordingly.

      0x000001829b18ca5c:   nopl   0x0(%rax)
      0x000001829b18ca60:   cmp    %r11d,%esi
      0x000001829b18ca63:   jge    0x000001829b18ce34           ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}
                                                                ; - java.nio.Buffer::nextPutIndex@10 (line 721)
                                                                ; - java.nio.HeapByteBuffer::put@6 (line 209)
                                                                ; - bench.MemPressureTest$Bigish::fill@22 (line 33)
                                                                ; - bench.MemPressureTest::test1@28 (line 47)
      0x000001829b18ca69:   mov    %esi,%ecx
      0x000001829b18ca6b:   add    %edx,%ecx                    ;*getfield position {reexecute=0 rethrow=0 return_oop=0}
                                                                ; - java.nio.Buffer::nextPutIndex@1 (line 720)
                                                                ; - java.nio.HeapByteBuffer::put@6 (line 209)
                                                                ; - bench.MemPressureTest$Bigish::fill@22 (line 33)
                                                                ; - bench.MemPressureTest::test1@28 (line 47)
      0x000001829b18ca6d:   mov    %esi,%eax
      0x000001829b18ca6f:   inc    %eax                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                                ; - bench.MemPressureTest$Bigish::fill@26 (line 32)
                                                                ; - bench.MemPressureTest::test1@28 (line 47)
      0x000001829b18ca71:   mov    %eax,0x18(%r10)              ;*putfield position {reexecute=0 rethrow=0 return_oop=0}
                                                                ; - java.nio.Buffer::nextPutIndex@25 (line 723)
                                                                ; - java.nio.HeapByteBuffer::put@6 (line 209)
                                                                ; - bench.MemPressureTest$Bigish::fill@22 (line 33)
                                                                ; - bench.MemPressureTest::test1@28 (line 47)
      0x000001829b18ca75:   cmp    %r8d,%ecx
      0x000001829b18ca78:   jae    0x000001829b18ce14
      0x000001829b18ca7e:   movslq %esi,%r9
      0x000001829b18ca81:   add    %r14,%r9
      0x000001829b18ca84:   mov    %bl,0x10(%rdi,%r9,1)         ;*bastore {reexecute=0 rethrow=0 return_oop=0}
                                                                ; - java.nio.HeapByteBuffer::put@13 (line 209)
                                                                ; - bench.MemPressureTest$Bigish::fill@22 (line 33)
                                                                ; - bench.MemPressureTest::test1@28 (line 47)
      0x000001829b18ca89:   cmp    $0x1000,%eax
      0x000001829b18ca8f:   jge    0x000001829b18ca95           ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                                ; - bench.MemPressureTest$Bigish::fill@14 (line 32)
                                                                ; - bench.MemPressureTest::test1@28 (line 47)
      0x000001829b18ca91:   mov    %eax,%esi
      0x000001829b18ca93:   jmp    0x000001829b18ca5c
    

    That's exactly what happens in test2. Since the ByteBuffer object is external to the compilation scope, JIT can't optimize it as freely as a local not-yet-published object.

    1. Is it possible at all to optimize the fill loop in case of an external buffer?

    The good news, it is possible. Just use the absolute put method instead of relative. In this case, position field remains unchanged, and JIT can easily vectorize the loop without a risk of breaking ByteBuffer invariants.

    for (int i = 0; i < SIZE; ++i) {
        b.put(i, bt);
    }
    

    With this change, the loop will be vectorized in both cases. Even better, now test2 becomes a lot faster than test1, proving that an object creation indeed has a performance overhead.

    Benchmark               Mode  Cnt      Score     Error   Units
    MemPressureTest.test1  thrpt   10   2447,370 ± 146,804  ops/ms
    MemPressureTest.test2  thrpt   10  15677,575 ± 136,075  ops/ms
    

    Conclusion

    1. The counterintuitive performance difference was caused by the JVM inability to vectorize the fill loop when the ByteBuffer object creation is not in the compilation scope.
    2. Prefer absolute get/put methods to relative ones where possible. Absolute methods are usually much faster, since they do not update the internal state of ByteBuffer, and JIT can apply more agressive optimizations.
    3. Object creation indeed has an overhead, as the modified benchmark shows.