I have a question regarding the Java Memory Model (JMM), particularly in the context of x86 architecture, which I find quite intriguing. One of the most confusing and often debated topics is the volatile
modifier.
I've heard a lot of misconceptions suggesting that volatile
effectively forbids the use of cached values for fields marked with this modifier. Some even claim it prohibits the use of registers. However, as far as I understand, these are oversimplified notions. I've never encountered any instructions that explicitly forbid using caches or registers for storing such fields. I'm not even sure if such behavior is technically possible.
So, my question is directed at experts in x86 architecture: What actually happens under the hood? What semantics does the volatile
modifier guarantee? From what I've seen, it seems to implement a full memory barrier using the LOCK
prefix combined with the add 0
instruction.
Let's settle this debate once and for all.
P.S. I'm really tired of hearing false claims from my fellow programmers about volatile
. They keep repeating the same story about cache usage, and I strongly feel they are terribly mistaken!
I have researched the Java Memory Model (JMM) and the use of the volatile
modifier. I expected to find clear explanations on how volatile
works in the context of x86 architecture, specifically regarding its impact on caching and register usage. However, I encountered conflicting information and misconceptions. I am seeking clarification from experts to understand the true semantics and behavior of volatile
on x86 systems.
volatile
: Bytecode and Machine InstructionsThis article represents the final piece of a broader exploration into the volatile
modifier in Java. In Part 1, we examined the origins and semantics of volatile
, providing a foundational understanding of its behavior. Part 2 focused on addressing misconceptions and delving into memory structures.
Now, in this conclusive installment, we will analyze the low-level implementation details, including machine-level instructions and processor-specific mechanisms, rounding out the complete picture of volatile
in Java. Let’s dive in.
volatile
FieldsOne common assumption among developers is that the volatile
modifier in Java introduces specialized bytecode instructions to enforce its semantics. Let’s examine this hypothesis with a straightforward experiment.
I created a simple Java file named VolatileTest.java
containing the following code:
public class VolatileTest {
private volatile long someField;
}
Here, a single private field is declared as volatile
. To investigate the bytecode, I compiled the file using the Java compiler (javac
) from the Oracle OpenJDK JDK 1.8.0_431 (x86) distribution and then disassembled the resulting .class
file with the javap
utility, using the -v
and -p
flags for detailed output, including private members.
I performed two compilations: one with the volatile
modifier and one without it. Below are the relevant excerpts of the bytecode for the someField
variable:
With volatile
:
private volatile long someField;
descriptor: J
flags: ACC_PRIVATE, ACC_VOLATILE
Without volatile
:
private long someField;
descriptor: J
flags: ACC_PRIVATE
The only difference is in the flags
field. The volatile
modifier adds the ACC_VOLATILE
flag to the field’s metadata. No additional bytecode instructions are generated.
To explore further, I examined the compiled .class
files using a hex editor (ImHex Hex Editor). The binary contents of the two files were nearly identical, differing only in the value of a single byte in the access_flags
field, which encodes the modifiers for each field.
For the someField
variable:
volatile
: 0x0042
volatile
: 0x0002
The difference is due to the bitmask for ACC_VOLATILE
, defined as 0x0040
. This demonstrates that the presence of the volatile
modifier merely toggles the appropriate flag in the access_flags
field.
The access_flags
field is a 16-bit value that encodes various field-level modifiers. Here’s a summary of relevant flags:
Modifier | Bit Value | Description |
---|---|---|
ACC_PUBLIC | 0x0001 |
Field is public . |
ACC_PRIVATE | 0x0002 |
Field is private . |
ACC_PROTECTED | 0x0004 |
Field is protected . |
ACC_STATIC | 0x0008 |
Field is static . |
ACC_FINAL | 0x0010 |
Field is final . |
ACC_VOLATILE | 0x0040 |
Field is volatile . |
ACC_TRANSIENT | 0x0080 |
Field is transient . |
ACC_SYNTHETIC | 0x1000 |
Field is compiler-generated. |
ACC_ENUM | 0x4000 |
Field is part of an enum . |
The volatile
keyword’s presence in the bytecode is entirely represented by the ACC_VOLATILE
flag. This flag is a single bit in the access_flags
field. This minimal change emphasizes that there is no "magic" at the bytecode level—the entire behavior of volatile
is represented by this single bit. The JVM uses this information to enforce the necessary semantics, without any additional complexity or hidden mechanisms.
Before diving into the low-level machine implementation of volatile
, it is essential to understand which x86 processors this discussion pertains to and how these processors are compatible with the JVM.
When Java was first released, official support was limited to 32-bit architectures, as the JVM itself—known as the Classic VM from Sun Microsystems—was initially 32-bit. Early Java did not distinguish between editions like SE, EE, or ME; this differentiation began with Java 1.2. Consequently, the first supported x86 processors were those in the Intel 80386 family, as they were the earliest 32-bit processors in the architecture.
Intel 80386 processors, though already considered outdated at the time of Java's debut, were supported by operating systems that natively ran Java, such as Windows NT 3.51, Windows 95, and Solaris x86. These operating systems ensured compatibility with the x86 architecture and the early JVM.
Interestingly, even processors as old as the Intel 8086, the first in the x86 family, could run certain versions of the JVM, albeit with significant limitations. This was made possible through the development of Java Platform, Micro Edition (Java ME), which offered a pared-down version of Java SE. Sun Microsystems developed a specialized virtual machine called K Virtual Machine (KVM) for these constrained environments. KVM required minimal resources, with some implementations running on devices with as little as 128 kilobytes of memory.
KVM's compatibility extended to both 16-bit and 32-bit processors, including those from the x86 family. According to the Oracle documentation in "J2ME Building Blocks for Mobile Devices," KVM was suitable for devices with minimal computational power:
"These devices typically contain 16- or 32-bit processors and a minimum total memory footprint of approximately 128 kilobytes."
Additionally, it was noted that KVM could work efficiently on CISC architectures such as x86:
"KVM is suitable for 16/32-bit RISC/CISC microprocessors with a total memory budget of no more than a few hundred kilobytes (potentially less than 128 kilobytes)."
Furthermore, KVM could run on native software stacks, such as RTOS (Real-Time Operating Systems), enabling dynamic and secure Java execution. For example:
"The actual role of a KVM in target devices can vary significantly. In some implementations, the KVM is used on top of an existing native software stack to give the device the ability to download and run dynamic, interactive, secure Java content on the device."
Alternatively, KVM could function as a standalone low-level system software layer:
"In other implementations, the KVM is used at a lower level to also implement the lower-level system software and applications of the device in the Java programming language."
This flexibility ensured that even early x86 processors, often embedded in devices with constrained resources, could leverage Java technologies. For instance, the Intel 80186 processor was widely used in embedded systems running RTOS and supported multitasking through software mechanisms like timer interrupts and cooperative multitasking.
Another example is the experimental implementation of the JVM for MS-DOS systems, such as the KaffePC Java VM. While this version of the JVM allowed for some level of Java execution, it excluded multithreading due to the strict single-tasking nature of MS-DOS. The absence of native multithreading in such environments highlights how certain Java features, including the guarantees provided by volatile
, were often simplified, significantly modified, or omitted entirely. Despite this, as we shall see, the principles underlying volatile
likely remained consistent with broader architectural concepts, ensuring applicability across diverse processor environments.
volatile
While volatile semantics were often simplified or omitted in these constrained environments, the core principles likely remained consistent with modern implementations. As our exploration will show, the fundamental ideas behind volatile behavior are deeply rooted in universal architectural concepts, making them applicable across diverse x86 processors.
Finally, let’s delve into how volatile
operations are implemented at the machine level. To illustrate this, we’ll examine a simple example where a volatile
field is assigned a value. To simplify the experiment, we’ll declare the field as static
(this does not influence the outcome).
public class VolatileTest {
private static volatile long someField;
public static void main(String[] args) {
someField = 5;
}
}
This code was executed with the following JVM options:
-server -Xcomp -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:CompileCommand=compileonly,VolatileTest.main
The test environment includes a dynamically linked hsdis
library, enabling runtime disassembly of JIT-compiled code. The -Xcomp
option forces the JVM to compile all code immediately, bypassing interpretation and allowing us to directly analyze the final machine instructions. The experiment was conducted on a 32-bit JDK 1.8, but identical results were observed across other versions and vendors of the HotSpot VM.
Here is the key assembly instruction generated for the putstatic
operation targeting the volatile
field:
0x026e3592: lock addl $0, (%esp) ;*putstatic someField
; - VolatileTest::main@3 (line 5)
This instruction reveals the underlying mechanism for enforcing the volatile
semantics during writes. Let’s dissect this line and understand its components.
LOCK
PrefixThe LOCK
prefix plays a crucial role in ensuring atomicity and enforcing a memory barrier. However, since LOCK
is a prefix and not an instruction by itself, it must be paired with another operation. Here, it is combined with the addl
instruction, which performs an addition.
Why Use addl
with LOCK
?
addl
instruction adds 0
to the value at the memory address stored in %esp
. Adding 0
ensures that the operation does not alter the memory's actual contents, making it a non-disruptive and lightweight operation.%esp
points to the top of the thread's stack, which is local to the thread and isolated from others. This ensures the operation is thread-safe and does not impact other threads or system-wide resources.LOCK
with a no-op arithmetic operation introduces minimal performance overhead while triggering the required side effects.%esp
The %esp
register (or %rsp
in 64-bit systems) serves as the stack pointer, dynamically pointing to the top of the local execution stack. Since the stack is strictly local to each thread, its memory addresses are unique across threads, ensuring isolation.
The use of %esp
in this context is particularly advantageous:
volatile
SemanticsThe LOCK
prefix ensures:
LOCK
enforces the strongest memory ordering guarantees, preventing any instruction reordering across the barrier.This mechanism elegantly addresses the potential issues of reordering and store buffer commits, ensuring that all preceding writes are visible before any subsequent operations.
Interestingly, no memory barrier is required for volatile
reads on x86 architectures. The x86 memory model inherently prohibits Load-Load
reorderings, which are the only type of reordering that volatile
semantics would otherwise prevent for reads. Thus, the hardware guarantees are sufficient without additional instructions.
volatile
FieldsNow, let us delve into the most intriguing aspect: ensuring atomicity for writes and reads of volatile
fields. For 64-bit JVMs, this issue is less critical since operations, even on 64-bit types like long
and double
, are inherently atomic. Nonetheless, examining how write operations are typically implemented in machine instructions can provide deeper insights.
For simplicity, consider the following code:
public class VolatileTest {
private static volatile long someField;
public static void main(String[] args) {
someField = 10;
}
}
Here’s the generated machine code corresponding to the write operation:
0x0000019f2dc6efdb: movabsq $0x76aea4020, %rsi
; {oop(a 'java/lang/Class' = 'VolatileTest')}
0x0000019f2dc6efe5: movabsq $0xa, %rdi
0x0000019f2dc6efef: movq %rdi, 0x20(%rsp)
0x0000019f2dc6eff4: vmovsd 0x20(%rsp), %xmm0
0x0000019f2dc6effa: vmovsd %xmm0, 0x68(%rsi)
0x0000019f2dc6efff: lock addl $0, (%rsp) ;*putstatic someField
; - VolatileTest::main@3 (line 5)
At first glance, the abundance of machine instructions directly interacting with registers might seem unnecessarily complex. However, this approach reflects specific architectural constraints and optimizations. Let us dissect these instructions step by step:
movabsq $0x76aea4020, %rsi
This instruction loads the absolute address (interpreted as a 64-bit numerical value) into the general-purpose register %rsi
. From the comment, we see this address points to the class metadata object (java/lang/Class
) containing information about the class and its static members. Since our volatile
field is static, its address is calculated relative to this metadata object.
movabsq $0xa, %rdi
Here, the immediate value 0xa
(hexadecimal representation of 10) is loaded into the %rdi
register. Since direct 64-bit memory writes using immediate values are prohibited in x86-64, this intermediate step is necessary.
movq %rdi, 0x20(%rsp)
The value from %rdi
is then stored on the stack at an offset of 0x20
from the current stack pointer %rsp
. This transfer is required because subsequent instructions will operate on SIMD registers, which cannot directly access general-purpose registers.
vmovsd 0x20(%rsp), %xmm0
This instruction moves the value from the stack into the SIMD register %xmm0
. Although designed for floating-point operations, it efficiently handles 64-bit bitwise representations. The apparent redundancy here (loading and storing via the stack) is a trade-off for leveraging AVX optimizations, which can boost performance on modern microarchitectures like Sandy Bridge.
vmovsd %xmm0, 0x68(%rsi)
The value in %xmm0
is stored in memory at the address calculated relative to %rsi
(0x68
offset). This represents the actual write operation to the volatile
field.
lock addl $0, (%rsp)
The lock
prefix ensures atomicity by locking the cache line corresponding to the specified memory address during execution. While addl $0
appears redundant, it serves as a lightweight no-op to enforce a full memory barrier, preventing reordering and ensuring visibility across threads.
Consider the following extended code:
public class VolatileTest {
private static volatile long someField;
public static void main(String[] args) {
someField = 10;
someField = 11;
someField = 12;
}
}
For this sequence, the compiler inserts a memory barrier after each write:
0x0000029ebe499bdb: movabsq $0x76aea4070, %rsi
; {oop(a 'java/lang/Class' = 'VolatileTest')}
0x0000029ebe499be5: movabsq $0xa, %rdi
0x0000029ebe499bef: movq %rdi, 0x20(%rsp)
0x0000029ebe499bf4: vmovsd 0x20(%rsp), %xmm0
0x0000029ebe499bfa: vmovsd %xmm0, 0x68(%rsi)
0x0000029ebe499bff: lock addl $0, (%rsp) ;*putstatic someField
; - VolatileTest::main@3 (line 5)
0x0000029ebe499c04: movabsq $0xb, %rdi
0x0000029ebe499c0e: movq %rdi, 0x28(%rsp)
0x0000029ebe499c13: vmovsd 0x28(%rsp), %xmm0
0x0000029ebe499c19: vmovsd %xmm0, 0x68(%rsi)
0x0000029ebe499c1e: lock addl $0, (%rsp) ;*putstatic someField
; - VolatileTest::main@9 (line 6)
0x0000029ebe499c23: movabsq $0xc, %rdi
0x0000029ebe499c2d: movq %rdi, 0x30(%rsp)
0x0000029ebe499c32: vmovsd 0x30(%rsp), %xmm0
0x0000029ebe499c38: vmovsd %xmm0, 0x68(%rsi)
0x0000029ebe499c3d: lock addl $0, (%rsp) ;*putstatic someField
; - VolatileTest::main@15 (line 7)
lock addl
instruction follows each write, ensuring proper visibility and preventing reordering.volatile
.In summary, the intricate sequence of operations underscores the JVM’s efforts to balance atomicity, performance, and compliance with the Java Memory Model.
When running the example code on a 32-bit JVM, the behavior differs significantly due to hardware constraints inherent to 32-bit architectures. Let’s dissect the observed assembly code:
0x02e837f0: movl $0x2f62f848, %esi
; {oop(a 'java/lang/Class' = 'VolatileTest')}
0x02e837f5: movl $0xa, %edi
0x02e837fa: movl $0, %ebx
0x02e837ff: movl %edi, 0x10(%esp)
0x02e83803: movl %ebx, 0x14(%esp)
0x02e83807: vmovsd 0x10(%esp), %xmm0
0x02e8380d: vmovsd %xmm0, 0x58(%esi)
0x02e83812: lock addl $0, (%esp) ;*putstatic someField
; - VolatileTest::main@3 (line 5)
Unlike their 64-bit counterparts, 32-bit general-purpose registers such as %esi
and %edi
lack the capacity to directly handle 64-bit values. As a result, long
values in 32-bit environments are processed in two separate parts: the lower 32 bits ($0xa
in this case) and the upper 32 bits ($0
). Each part is loaded into a separate 32-bit register and later combined for further processing. This limitation inherently increases the complexity of ensuring atomic operations.
Despite the constraints of 32-bit general-purpose registers, SIMD registers such as %xmm0
offer a workaround. The vmovsd
instruction is used to load the full 64-bit value into %xmm0
atomically. The two halves of the long
value, previously placed on the stack at offsets 0x10(%esp)
and 0x14(%esp)
, are accessed as a unified 64-bit value during this operation. This highlights the JVM’s efficiency in leveraging modern instruction sets like AVX for compatibility and performance in older architectures.
Let’s delve into the behavior of the same example but run on a 32-bit JVM. Below is the assembly output generated during execution:
0x02e837f0: movl $0x2f62f848, %esi
; {oop(a 'java/lang/Class' = 'VolatileTest')}
0x02e837f5: movl $0xa, %edi
0x02e837fa: movl $0, %ebx
0x02e837ff: movl %edi, 0x10(%esp)
0x02e83803: movl %ebx, 0x14(%esp)
0x02e83807: vmovsd 0x10(%esp), %xmm0
0x02e8380d: vmovsd %xmm0, 0x58(%esi)
0x02e83812: lock addl $0, (%esp) ;*putstatic someField
; - VolatileTest::main@3 (line 5)
Here we see a similar unified approach to the 64-bit systems but driven more by necessity. In 32-bit systems, the absence of 64-bit general-purpose registers means the theoretical capabilities are significantly reduced.
LOCK
Selectively?In 32-bit systems, reads and writes are performed in two instructions rather than one. This inherently breaks atomicity, even with the LOCK
prefix. While it might seem logical to rely on LOCK
with its bus-locking capabilities, it is often avoided in such scenarios whenever possible due to its substantial performance impact.
To maintain a priority for non-blocking mechanisms, developers often rely on SIMD instructions, such as those involving XMM registers. In our example, the vmovsd
instruction is used, which loads the values $0xa
and $0
(representing the lower and upper 32-bit halves of the 64-bit long
value) into two different 32-bit registers. These are then stored sequentially on the stack and combined atomically using vmovsd
.
What happens if the processor lacks AVX support? By disabling AVX explicitly (-XX:UseAVX=0
), we simulate an environment without AVX functionality. The resulting changes in the assembly are:
0x02da3507: movsd 0x10(%esp), %xmm0
0x02da350d: movsd %xmm0, 0x58(%esi)
This highlights that the approach remains fundamentally the same. However, the vmovsd
instruction is replaced with the older movsd
from the SSE instruction set. While movsd
lacks the performance enhancements of AVX and operates as a dual-operand instruction, it serves the same purpose effectively when AVX is unavailable.
If SSE support is also disabled (-XX:UseSSE=0
), the fallback mechanism relies on the Floating Point Unit (FPU):
0x02bc2449: fildll 0x10(%esp)
0x02bc244d: fistpll 0x58(%esi)
Here, the fildll
and fistpll
instructions load and store the value directly to and from the FPU stack, bypassing the need for SIMD registers. Unlike typical FPU operations involving 80-bit extended precision, these instructions ensure the value remains a raw 64-bit integer, avoiding unnecessary conversions.
For processors such as the Intel 80486SX or 80386 without integrated coprocessors, the situation becomes even more challenging. These processors lack native instructions like CMPXCHG8B
(introduced in the Intel Pentium series) and 64-bit atomicity mechanisms. In such cases, ensuring atomicity requires software-based solutions, such as OS-level mutex locks, which are significantly heavier and less efficient.
Finally, let’s examine the behavior during a read operation, such as when retrieving a value for display. The following assembly demonstrates the process:
0x02e62346: fildll 0x58(%ecx)
0x02e62349: fistpll 0x18(%esp) ;*getstatic someField
; - VolatileTest::main@9 (line 7)
0x02e6234d: movl 0x18(%esp), %edi
0x02e62351: movl 0x1c(%esp), %ecx
0x02e62355: movl %edi, (%esp)
0x02e62358: movl %ecx, 4(%esp)
0x02e6235c: movl %esi, %ecx ;*invokevirtual println
; - VolatileTest::main@12 (line 7)
The read operation essentially mirrors the write process but in reverse. The value is loaded from memory (e.g., 0x58(%ecx)
) into ST0
, then safely written to the stack. Since the stack is inherently thread-local, this intermediate step ensures that any further operations on the value are thread-safe.
All experiments and observations in this article were conducted using the following hardware and software configuration:
Operating System
Processor
Java Development Kit (JDK)
Two versions of Oracle JDK 1.8.0_431 were used during the experiments:
JVM Settings
The following JVM options were applied:
-server -Xcomp -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:CompileCommand=compileonly,VolatileTest.main
Tools
-v -p
flags.This comprehensive exploration highlights the JVM's remarkable adaptability in enforcing volatile
semantics across a range of architectures and processor capabilities. From AVX and SSE to FPU-based fallbacks, each approach balances performance, hardware limitations, and atomicity.
Thank you for accompanying me on this deep dive into volatile
. This analysis has answered many questions and broadened my understanding of low-level JVM implementations. I hope it has been equally insightful for you!