I ran into this scenario in the process of developing a lightweight library for asynchronous messaging. Trying to get an idea of the cost of creating lots of medium sized objects with short lifetimes, I wrote the following test:
import java.nio.ByteBuffer;
import java.util.Random;
public class MemPressureTest {
static final int SIZE = 4096;
static final class Bigish {
final ByteBuffer b;
public Bigish() {
this(ByteBuffer.allocate(SIZE));
}
public Bigish(ByteBuffer b) {
this.b = b;
}
public void fill(byte bt) {
b.clear();
for (int i = 0; i < SIZE; ++i) {
b.put(bt);
}
}
}
public static void main(String[] args) {
Random random = new Random(1);
Bigish tmp = new Bigish();
for (int i = 0; i < 3e7; ++i) {
tmp.fill((byte)random.nextInt(255));
}
}
}
On my laptop, with default GC settings, it runs in about 95 seconds:
/tmp$ time java -Xlog:gc MemPressureTest
[0.006s][info][gc] Using G1
real 1m35.552s
user 1m33.658s
sys 0m0.428s
This is where things get strange. I tweaked the program to allocate a new object for each iteration:
...
Random random = new Random(1);
for (int i = 0; i < 3e7; ++i) {
Bigish tmp = new Bigish();
tmp.fill((byte)random.nextInt(255));
}
...
In theory, this should add some small overhead, but none of the objects should ever be promoted out of Eden. At best, I'd expect the runtimes to be close to identical. However, this test completes in ~17 seconds:
/tmp$ time java -Xlog:gc MemPressureTest
[0.007s][info][gc] Using G1
[0.090s][info][gc] GC(0) Pause Young (Normal) (G1 Evacuation Pause) 23M->1M(130M) 1.304ms
[0.181s][info][gc] GC(1) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(130M) 0.870ms
[0.247s][info][gc] GC(2) Pause Young (Normal) (G1 Evacuation Pause) 76M->0M(130M) 0.844ms
[0.317s][info][gc] GC(3) Pause Young (Normal) (G1 Evacuation Pause) 75M->0M(130M) 0.793ms
[0.381s][info][gc] GC(4) Pause Young (Normal) (G1 Evacuation Pause) 75M->0M(130M) 0.859ms
[lots of similar GC pauses, snipped for brevity]
[16.608s][info][gc] GC(482) Pause Young (Normal) (G1 Evacuation Pause) 254M->0M(425M) 0.765ms
[16.643s][info][gc] GC(483) Pause Young (Normal) (G1 Evacuation Pause) 254M->0M(425M) 0.580ms
[16.676s][info][gc] GC(484) Pause Young (Normal) (G1 Evacuation Pause) 254M->0M(425M) 0.841ms
real 0m16.766s
user 0m16.578s
sys 0m0.576s
I ran both versions several times, with near identical results to the above. I feel like I must be missing something very obvious. Am I going insane? What could explain this difference in performance?
=== EDIT ===
I rewrote the test using JMH as per apangin and dan1st's suggestions:
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import java.nio.ByteBuffer;
import java.util.Random;
public class MemPressureTest {
static final int SIZE = 4096;
@State(Scope.Benchmark)
public static class Bigish {
final ByteBuffer b;
private Blackhole blackhole;
@Setup(Level.Trial)
public void setup(Blackhole blackhole) {
this.blackhole = blackhole;
}
public Bigish() {
this.b = ByteBuffer.allocate(SIZE);
}
public void fill(byte bt) {
b.clear();
for (int i = 0; i < SIZE; ++i) {
b.put(bt);
}
blackhole.consume(b);
}
}
static Random random = new Random(1);
@Benchmark
public static void test1(Blackhole blackhole) {
Bigish tmp = new Bigish();
tmp.setup(blackhole);
tmp.fill((byte)random.nextInt(255));
blackhole.consume(tmp);
}
@Benchmark
public static void test2(Bigish perm) {
perm.fill((byte) random.nextInt(255));
}
}
Still, the second test much slower:
> Task :jmh
# JMH version: 1.35
# VM version: JDK 18.0.1.1, OpenJDK 64-Bit Server VM, 18.0.1.1+2-6
# VM invoker: /Users/xxx/Library/Java/JavaVirtualMachines/openjdk-18.0.1.1/Contents/Home/bin/java
# VM options: -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/Users/xxx/Dev/MemTests/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.xxx.MemPressureTest.test1
# Run progress: 0.00% complete, ETA 00:16:40
# Fork: 1 of 5
# Warmup Iteration 1: 2183998.556 ops/s
# Warmup Iteration 2: 2281885.941 ops/s
# Warmup Iteration 3: 2239644.018 ops/s
# Warmup Iteration 4: 1608047.994 ops/s
# Warmup Iteration 5: 1992314.001 ops/s
Iteration 1: 2053657.571 ops/s3s]
Iteration 2: 2054957.773 ops/sm 3s]
Iteration 3: 2051595.233 ops/sm 13s]
Iteration 4: 2054878.239 ops/sm 23s]
Iteration 5: 2031111.214 ops/sm 33s]
# Run progress: 10.00% complete, ETA 00:15:04
# Fork: 2 of 5
# Warmup Iteration 1: 2228594.345 ops/s
# Warmup Iteration 2: 2257983.355 ops/s
# Warmup Iteration 3: 2063130.244 ops/s
# Warmup Iteration 4: 1629084.669 ops/s
# Warmup Iteration 5: 2063018.496 ops/s
Iteration 1: 1939260.937 ops/sm 33s]
Iteration 2: 1791414.018 ops/sm 43s]
Iteration 3: 1914987.221 ops/sm 53s]
Iteration 4: 1969484.898 ops/sm 3s]
Iteration 5: 1891440.624 ops/sm 13s]
# Run progress: 20.00% complete, ETA 00:13:23
# Fork: 3 of 5
# Warmup Iteration 1: 2228664.719 ops/s
# Warmup Iteration 2: 2263677.403 ops/s
# Warmup Iteration 3: 2237032.464 ops/s
# Warmup Iteration 4: 2040040.243 ops/s
# Warmup Iteration 5: 2038848.530 ops/s
Iteration 1: 2023934.952 ops/sm 14s]
Iteration 2: 2041874.437 ops/sm 24s]
Iteration 3: 2002858.770 ops/sm 34s]
Iteration 4: 2039727.607 ops/sm 44s]
Iteration 5: 2045827.670 ops/sm 54s]
# Run progress: 30.00% complete, ETA 00:11:43
# Fork: 4 of 5
# Warmup Iteration 1: 2105430.688 ops/s
# Warmup Iteration 2: 2279387.762 ops/s
# Warmup Iteration 3: 2228346.691 ops/s
# Warmup Iteration 4: 1438607.183 ops/s
# Warmup Iteration 5: 2059319.745 ops/s
Iteration 1: 1112543.932 ops/sm 54s]
Iteration 2: 1977077.976 ops/sm 4s]
Iteration 3: 2040147.355 ops/sm 14s]
Iteration 4: 1975766.032 ops/sm 24s]
Iteration 5: 2003532.092 ops/sm 34s]
# Run progress: 40.00% complete, ETA 00:10:02
# Fork: 5 of 5
# Warmup Iteration 1: 2240203.848 ops/s
# Warmup Iteration 2: 2245673.994 ops/s
# Warmup Iteration 3: 2096257.864 ops/s
# Warmup Iteration 4: 2046527.740 ops/s
# Warmup Iteration 5: 2050379.941 ops/s
Iteration 1: 2050691.989 ops/sm 35s]
Iteration 2: 2057803.100 ops/sm 45s]
Iteration 3: 2058634.766 ops/sm 55s]
Iteration 4: 2060596.595 ops/sm 5s]
Iteration 5: 2061282.107 ops/sm 15s]
Result "com.xxx.MemPressureTest.test1":
1972203.484 ±(99.9%) 142904.698 ops/s [Average]
(min, avg, max) = (1112543.932, 1972203.484, 2061282.107), stdev = 190773.683
CI (99.9%): [1829298.786, 2115108.182] (assumes normal distribution)
# JMH version: 1.35
# VM version: JDK 18.0.1.1, OpenJDK 64-Bit Server VM, 18.0.1.1+2-6
# VM invoker: /Users/xxx/Library/Java/JavaVirtualMachines/openjdk-18.0.1.1/Contents/Home/bin/java
# VM options: -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/Users/xxx/Dev/MemTests/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.xxx.MemPressureTest.test2
# Run progress: 50.00% complete, ETA 00:08:22
# Fork: 1 of 5
# Warmup Iteration 1: 282751.407 ops/s
# Warmup Iteration 2: 283333.984 ops/s
# Warmup Iteration 3: 293785.079 ops/s
# Warmup Iteration 4: 268403.105 ops/s
# Warmup Iteration 5: 280054.277 ops/s
Iteration 1: 279093.118 ops/s9m 15s]
Iteration 2: 282782.996 ops/s9m 25s]
Iteration 3: 282688.921 ops/s9m 35s]
Iteration 4: 291578.963 ops/s9m 45s]
Iteration 5: 294835.777 ops/s9m 55s]
# Run progress: 60.00% complete, ETA 00:06:41
# Fork: 2 of 5
# Warmup Iteration 1: 283735.550 ops/s
# Warmup Iteration 2: 283536.547 ops/s
# Warmup Iteration 3: 294403.173 ops/s
# Warmup Iteration 4: 284161.042 ops/s
# Warmup Iteration 5: 281719.077 ops/s
Iteration 1: 276838.416 ops/s10m 56s]
Iteration 2: 284063.117 ops/s11m 6s]
Iteration 3: 282361.985 ops/s11m 16s]
Iteration 4: 289125.092 ops/s11m 26s]
Iteration 5: 294236.625 ops/s11m 36s]
# Run progress: 70.00% complete, ETA 00:05:01
# Fork: 3 of 5
# Warmup Iteration 1: 284567.336 ops/s
# Warmup Iteration 2: 283548.713 ops/s
# Warmup Iteration 3: 294317.511 ops/s
# Warmup Iteration 4: 283501.873 ops/s
# Warmup Iteration 5: 283691.306 ops/s
Iteration 1: 283462.749 ops/s12m 36s]
Iteration 2: 284120.587 ops/s12m 46s]
Iteration 3: 264878.952 ops/s12m 56s]
Iteration 4: 292681.168 ops/s13m 6s]
Iteration 5: 295279.759 ops/s13m 16s]
# Run progress: 80.00% complete, ETA 00:03:20
# Fork: 4 of 5
# Warmup Iteration 1: 284823.519 ops/s
# Warmup Iteration 2: 283913.207 ops/s
# Warmup Iteration 3: 294401.483 ops/s
# Warmup Iteration 4: 283998.027 ops/s
# Warmup Iteration 5: 283987.408 ops/s
Iteration 1: 278014.618 ops/s14m 17s]
Iteration 2: 283431.491 ops/s14m 27s]
Iteration 3: 284465.945 ops/s14m 37s]
Iteration 4: 293202.934 ops/s14m 47s]
Iteration 5: 290059.807 ops/s14m 57s]
# Run progress: 90.00% complete, ETA 00:01:40
# Fork: 5 of 5
# Warmup Iteration 1: 285598.809 ops/s
# Warmup Iteration 2: 284434.916 ops/s
# Warmup Iteration 3: 294355.547 ops/s
# Warmup Iteration 4: 284307.860 ops/s
# Warmup Iteration 5: 284297.362 ops/s
Iteration 1: 283676.043 ops/s15m 57s]
Iteration 2: 283609.750 ops/s16m 7s]
Iteration 3: 284575.124 ops/s16m 17s]
Iteration 4: 293564.269 ops/s16m 27s]
Iteration 5: 216267.883 ops/s16m 37s]
Result "com.xxx.MemPressureTest.test2":
282755.844 ±(99.9%) 11599.112 ops/s [Average]
(min, avg, max) = (216267.883, 282755.844, 295279.759), stdev = 15484.483
CI (99.9%): [271156.731, 294354.956] (assumes normal distribution)
The JMH Blackhole should prevent code removal and the fact that JMH is now in charge of running separate iterations should prevent parallelization, right? Shouldn't Blackhole also stop the object from being confined to the stack? Also, wouldn't there be more variation between warmup iterations if hotspot were still doing a significant amount of optimization?
Creating a new ByteBuffer right before filling, indeed helps JIT compiler to produce better optimized code, when you use relative put
methods, and here is why.
test1
method, buffer instantiation appears in the same compilation scope as filling:Bigish tmp = new Bigish();
tmp.setup(blackhole);
tmp.fill((byte)random.nextInt(255));
The JVM knows everything about the created buffer: its exact size and its backing array, it knows the buffer has not been published yet, no other thread sees it. So, the JVM can agressively optimize the fill loop: vectorize it using AVX instructions and unroll it to set 512 bytes at a time:
0x000001cdf60c9ae0: mov %r9d,%r8d
0x000001cdf60c9ae3: movslq %r8d,%r9
0x000001cdf60c9ae6: add %r11,%r9
0x000001cdf60c9ae9: vmovdqu %ymm0,0x10(%rcx,%r9,1)
0x000001cdf60c9af0: vmovdqu %ymm0,0x30(%rcx,%r9,1)
0x000001cdf60c9af7: vmovdqu %ymm0,0x50(%rcx,%r9,1)
0x000001cdf60c9afe: vmovdqu %ymm0,0x70(%rcx,%r9,1)
0x000001cdf60c9b05: vmovdqu %ymm0,0x90(%rcx,%r9,1)
0x000001cdf60c9b0f: vmovdqu %ymm0,0xb0(%rcx,%r9,1)
0x000001cdf60c9b19: vmovdqu %ymm0,0xd0(%rcx,%r9,1)
0x000001cdf60c9b23: vmovdqu %ymm0,0xf0(%rcx,%r9,1)
0x000001cdf60c9b2d: vmovdqu %ymm0,0x110(%rcx,%r9,1)
0x000001cdf60c9b37: vmovdqu %ymm0,0x130(%rcx,%r9,1)
0x000001cdf60c9b41: vmovdqu %ymm0,0x150(%rcx,%r9,1)
0x000001cdf60c9b4b: vmovdqu %ymm0,0x170(%rcx,%r9,1)
0x000001cdf60c9b55: vmovdqu %ymm0,0x190(%rcx,%r9,1)
0x000001cdf60c9b5f: vmovdqu %ymm0,0x1b0(%rcx,%r9,1)
0x000001cdf60c9b69: vmovdqu %ymm0,0x1d0(%rcx,%r9,1)
0x000001cdf60c9b73: vmovdqu %ymm0,0x1f0(%rcx,%r9,1)
0x000001cdf60c9b7d: mov %r8d,%r9d
0x000001cdf60c9b80: add $0x200,%r9d
0x000001cdf60c9b87: cmp %r10d,%r9d
0x000001cdf60c9b8a: jl 0x000001cdf60c9ae0
put
method. It not only sets a byte in a ByteBuffer, but also updates the position
field. Note that the above vectorized loop does not update the position in memory. JVM sets it just once after the loop - it is allowed to do so as long as nobody can observe an inconsistent state of the buffer.Bigish tmp = new Bigish();
volatileField = tmp; // publish
tmp.setup(blackhole);
tmp.fill((byte)random.nextInt(255));
The loop optimization breaks; now the array bytes are filled one by one, and the position field is incremented accordingly.
0x000001829b18ca5c: nopl 0x0(%rax)
0x000001829b18ca60: cmp %r11d,%esi
0x000001829b18ca63: jge 0x000001829b18ce34 ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}
; - java.nio.Buffer::nextPutIndex@10 (line 721)
; - java.nio.HeapByteBuffer::put@6 (line 209)
; - bench.MemPressureTest$Bigish::fill@22 (line 33)
; - bench.MemPressureTest::test1@28 (line 47)
0x000001829b18ca69: mov %esi,%ecx
0x000001829b18ca6b: add %edx,%ecx ;*getfield position {reexecute=0 rethrow=0 return_oop=0}
; - java.nio.Buffer::nextPutIndex@1 (line 720)
; - java.nio.HeapByteBuffer::put@6 (line 209)
; - bench.MemPressureTest$Bigish::fill@22 (line 33)
; - bench.MemPressureTest::test1@28 (line 47)
0x000001829b18ca6d: mov %esi,%eax
0x000001829b18ca6f: inc %eax ;*iinc {reexecute=0 rethrow=0 return_oop=0}
; - bench.MemPressureTest$Bigish::fill@26 (line 32)
; - bench.MemPressureTest::test1@28 (line 47)
0x000001829b18ca71: mov %eax,0x18(%r10) ;*putfield position {reexecute=0 rethrow=0 return_oop=0}
; - java.nio.Buffer::nextPutIndex@25 (line 723)
; - java.nio.HeapByteBuffer::put@6 (line 209)
; - bench.MemPressureTest$Bigish::fill@22 (line 33)
; - bench.MemPressureTest::test1@28 (line 47)
0x000001829b18ca75: cmp %r8d,%ecx
0x000001829b18ca78: jae 0x000001829b18ce14
0x000001829b18ca7e: movslq %esi,%r9
0x000001829b18ca81: add %r14,%r9
0x000001829b18ca84: mov %bl,0x10(%rdi,%r9,1) ;*bastore {reexecute=0 rethrow=0 return_oop=0}
; - java.nio.HeapByteBuffer::put@13 (line 209)
; - bench.MemPressureTest$Bigish::fill@22 (line 33)
; - bench.MemPressureTest::test1@28 (line 47)
0x000001829b18ca89: cmp $0x1000,%eax
0x000001829b18ca8f: jge 0x000001829b18ca95 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
; - bench.MemPressureTest$Bigish::fill@14 (line 32)
; - bench.MemPressureTest::test1@28 (line 47)
0x000001829b18ca91: mov %eax,%esi
0x000001829b18ca93: jmp 0x000001829b18ca5c
That's exactly what happens in test2
. Since the ByteBuffer object is external to the compilation scope, JIT can't optimize it as freely as a local not-yet-published object.
The good news, it is possible. Just use the absolute put
method instead of relative. In this case, position
field remains unchanged, and JIT can easily vectorize the loop without a risk of breaking ByteBuffer invariants.
for (int i = 0; i < SIZE; ++i) {
b.put(i, bt);
}
With this change, the loop will be vectorized in both cases. Even better, now test2
becomes a lot faster than test1
, proving that an object creation indeed has a performance overhead.
Benchmark Mode Cnt Score Error Units
MemPressureTest.test1 thrpt 10 2447,370 ± 146,804 ops/ms
MemPressureTest.test2 thrpt 10 15677,575 ± 136,075 ops/ms