javamultithreadingx86volatilejava-memory-model

Understanding the volatile Modifier in the Context of x86 Architecture and the Java Memory Model (JMM)


I have a question regarding the Java Memory Model (JMM), particularly in the context of x86 architecture, which I find quite intriguing. One of the most confusing and often debated topics is the volatile modifier.

I've heard a lot of misconceptions suggesting that volatile effectively forbids the use of cached values for fields marked with this modifier. Some even claim it prohibits the use of registers. However, as far as I understand, these are oversimplified notions. I've never encountered any instructions that explicitly forbid using caches or registers for storing such fields. I'm not even sure if such behavior is technically possible.

So, my question is directed at experts in x86 architecture: What actually happens under the hood? What semantics does the volatile modifier guarantee? From what I've seen, it seems to implement a full memory barrier using the LOCK prefix combined with the add 0 instruction.

Let's settle this debate once and for all.

P.S. I'm really tired of hearing false claims from my fellow programmers about volatile. They keep repeating the same story about cache usage, and I strongly feel they are terribly mistaken!

I have researched the Java Memory Model (JMM) and the use of the volatile modifier. I expected to find clear explanations on how volatile works in the context of x86 architecture, specifically regarding its impact on caching and register usage. However, I encountered conflicting information and misconceptions. I am seeking clarification from experts to understand the true semantics and behavior of volatile on x86 systems.


Solution

  • Low-Level Implementation of volatile: Bytecode and Machine Instructions

    This article represents the final piece of a broader exploration into the volatile modifier in Java. In Part 1, we examined the origins and semantics of volatile, providing a foundational understanding of its behavior. Part 2 focused on addressing misconceptions and delving into memory structures.

    Now, in this conclusive installment, we will analyze the low-level implementation details, including machine-level instructions and processor-specific mechanisms, rounding out the complete picture of volatile in Java. Let’s dive in.


    Exploring the Bytecode for volatile Fields

    One common assumption among developers is that the volatile modifier in Java introduces specialized bytecode instructions to enforce its semantics. Let’s examine this hypothesis with a straightforward experiment.

    Experimental Setup

    I created a simple Java file named VolatileTest.java containing the following code:

    public class VolatileTest {
        private volatile long someField;
    }
    

    Here, a single private field is declared as volatile. To investigate the bytecode, I compiled the file using the Java compiler (javac) from the Oracle OpenJDK JDK 1.8.0_431 (x86) distribution and then disassembled the resulting .class file with the javap utility, using the -v and -p flags for detailed output, including private members.

    Comparing Results

    I performed two compilations: one with the volatile modifier and one without it. Below are the relevant excerpts of the bytecode for the someField variable:

    With volatile:

      private volatile long someField;
        descriptor: J
        flags: ACC_PRIVATE, ACC_VOLATILE
    

    Without volatile:

      private long someField;
        descriptor: J
        flags: ACC_PRIVATE
    

    The only difference is in the flags field. The volatile modifier adds the ACC_VOLATILE flag to the field’s metadata. No additional bytecode instructions are generated.

    Hexadecimal Analysis

    To explore further, I examined the compiled .class files using a hex editor (ImHex Hex Editor). The binary contents of the two files were nearly identical, differing only in the value of a single byte in the access_flags field, which encodes the modifiers for each field.

    For the someField variable:

    The difference is due to the bitmask for ACC_VOLATILE, defined as 0x0040. This demonstrates that the presence of the volatile modifier merely toggles the appropriate flag in the access_flags field.

    Modifiers and Flags

    The access_flags field is a 16-bit value that encodes various field-level modifiers. Here’s a summary of relevant flags:

    Modifier Bit Value Description
    ACC_PUBLIC 0x0001 Field is public.
    ACC_PRIVATE 0x0002 Field is private.
    ACC_PROTECTED 0x0004 Field is protected.
    ACC_STATIC 0x0008 Field is static.
    ACC_FINAL 0x0010 Field is final.
    ACC_VOLATILE 0x0040 Field is volatile.
    ACC_TRANSIENT 0x0080 Field is transient.
    ACC_SYNTHETIC 0x1000 Field is compiler-generated.
    ACC_ENUM 0x4000 Field is part of an enum.

    Implications

    The volatile keyword’s presence in the bytecode is entirely represented by the ACC_VOLATILE flag. This flag is a single bit in the access_flags field. This minimal change emphasizes that there is no "magic" at the bytecode level—the entire behavior of volatile is represented by this single bit. The JVM uses this information to enforce the necessary semantics, without any additional complexity or hidden mechanisms.


    x86 Processors and JVM Compatibility

    Before diving into the low-level machine implementation of volatile, it is essential to understand which x86 processors this discussion pertains to and how these processors are compatible with the JVM.

    Early JVM and x86 Support

    When Java was first released, official support was limited to 32-bit architectures, as the JVM itself—known as the Classic VM from Sun Microsystems—was initially 32-bit. Early Java did not distinguish between editions like SE, EE, or ME; this differentiation began with Java 1.2. Consequently, the first supported x86 processors were those in the Intel 80386 family, as they were the earliest 32-bit processors in the architecture.

    Intel 80386 processors, though already considered outdated at the time of Java's debut, were supported by operating systems that natively ran Java, such as Windows NT 3.51, Windows 95, and Solaris x86. These operating systems ensured compatibility with the x86 architecture and the early JVM.

    Compatibility with Older x86 Processors

    Interestingly, even processors as old as the Intel 8086, the first in the x86 family, could run certain versions of the JVM, albeit with significant limitations. This was made possible through the development of Java Platform, Micro Edition (Java ME), which offered a pared-down version of Java SE. Sun Microsystems developed a specialized virtual machine called K Virtual Machine (KVM) for these constrained environments. KVM required minimal resources, with some implementations running on devices with as little as 128 kilobytes of memory.

    KVM's compatibility extended to both 16-bit and 32-bit processors, including those from the x86 family. According to the Oracle documentation in "J2ME Building Blocks for Mobile Devices," KVM was suitable for devices with minimal computational power:

    "These devices typically contain 16- or 32-bit processors and a minimum total memory footprint of approximately 128 kilobytes."

    Additionally, it was noted that KVM could work efficiently on CISC architectures such as x86:

    "KVM is suitable for 16/32-bit RISC/CISC microprocessors with a total memory budget of no more than a few hundred kilobytes (potentially less than 128 kilobytes)."

    Furthermore, KVM could run on native software stacks, such as RTOS (Real-Time Operating Systems), enabling dynamic and secure Java execution. For example:

    "The actual role of a KVM in target devices can vary significantly. In some implementations, the KVM is used on top of an existing native software stack to give the device the ability to download and run dynamic, interactive, secure Java content on the device."

    Alternatively, KVM could function as a standalone low-level system software layer:

    "In other implementations, the KVM is used at a lower level to also implement the lower-level system software and applications of the device in the Java programming language."

    This flexibility ensured that even early x86 processors, often embedded in devices with constrained resources, could leverage Java technologies. For instance, the Intel 80186 processor was widely used in embedded systems running RTOS and supported multitasking through software mechanisms like timer interrupts and cooperative multitasking.

    Another example is the experimental implementation of the JVM for MS-DOS systems, such as the KaffePC Java VM. While this version of the JVM allowed for some level of Java execution, it excluded multithreading due to the strict single-tasking nature of MS-DOS. The absence of native multithreading in such environments highlights how certain Java features, including the guarantees provided by volatile, were often simplified, significantly modified, or omitted entirely. Despite this, as we shall see, the principles underlying volatile likely remained consistent with broader architectural concepts, ensuring applicability across diverse processor environments.

    Implications for volatile

    While volatile semantics were often simplified or omitted in these constrained environments, the core principles likely remained consistent with modern implementations. As our exploration will show, the fundamental ideas behind volatile behavior are deeply rooted in universal architectural concepts, making them applicable across diverse x86 processors.


    Low-Level Solution to Reordering and Store Buffer Commit Issues

    Finally, let’s delve into how volatile operations are implemented at the machine level. To illustrate this, we’ll examine a simple example where a volatile field is assigned a value. To simplify the experiment, we’ll declare the field as static (this does not influence the outcome).

    public class VolatileTest {
        private static volatile long someField;
    
        public static void main(String[] args) {
            someField = 5;
        }
    }
    

    This code was executed with the following JVM options: -server -Xcomp -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:CompileCommand=compileonly,VolatileTest.main

    Experimental Setup and Execution Context

    The test environment includes a dynamically linked hsdis library, enabling runtime disassembly of JIT-compiled code. The -Xcomp option forces the JVM to compile all code immediately, bypassing interpretation and allowing us to directly analyze the final machine instructions. The experiment was conducted on a 32-bit JDK 1.8, but identical results were observed across other versions and vendors of the HotSpot VM.

    Here is the key assembly instruction generated for the putstatic operation targeting the volatile field:

    0x026e3592: lock addl $0, (%esp)  ;*putstatic someField
                                      ; - VolatileTest::main@3 (line 5)
    

    This instruction reveals the underlying mechanism for enforcing the volatile semantics during writes. Let’s dissect this line and understand its components.

    Breaking Down the LOCK Prefix

    The LOCK prefix plays a crucial role in ensuring atomicity and enforcing a memory barrier. However, since LOCK is a prefix and not an instruction by itself, it must be paired with another operation. Here, it is combined with the addl instruction, which performs an addition.

    Why Use addl with LOCK?

    Key Properties of the Stack and %esp

    The %esp register (or %rsp in 64-bit systems) serves as the stack pointer, dynamically pointing to the top of the local execution stack. Since the stack is strictly local to each thread, its memory addresses are unique across threads, ensuring isolation.

    The use of %esp in this context is particularly advantageous:

    1. Isolation: Stack memory is inherently private to a thread, preventing cross-thread interference.
    2. Dynamic Adaptation: The stack pointer updates automatically as the stack grows or shrinks, simplifying memory management.
    3. Safety: The use of the stack pointer guarantees that the memory being "locked" is not shared, reducing contention risks.

    Achieving the volatile Semantics

    The LOCK prefix ensures:

    1. Atomicity: No other thread or processor can access the specified memory region until the operation completes.
    2. Memory Barrier Semantics: By default, the LOCK prefix enforces strong memory ordering guarantees, ensuring specific instruction sequences cannot be reordered across the barrier.

    However, the mechanism does not enforce a complete draining of the store buffer in all cases. Only the stores that precede the barrier in program order (PO) are guaranteed to be committed to the coherent cache (L1d). This means the draining process is partial: it applies only to the stores that must be visible to subsequent operations as mandated by the memory model. Stores prepared for later commits (but not preceding the barrier) remain in the buffer until their turn in PO arrives.

    This nuanced behavior explains why the LOCK prefix does not block all instructions. For example:

    In summary, the LOCK prefix provides targeted control over memory ordering and visibility, ensuring:

    This mechanism helps address issues related to reordering and store buffer visibility but operates selectively, without enforcing a complete halt on all subsequent operations.

    A Note on Read Operations

    Interestingly, no memory barrier is required for volatile reads on x86 architectures. The x86 memory model inherently prohibits Load-Load reorderings, which are the only type of reordering that volatile semantics would otherwise prevent for reads. Thus, the hardware guarantees are sufficient without additional instructions.


    Atomicity of Writes and Reads in volatile Fields

    Now, let us delve into the most intriguing aspect: ensuring atomicity for writes and reads of volatile fields. For 64-bit JVMs, this issue is less critical since operations, even on 64-bit types like long and double, are inherently atomic. Nonetheless, examining how write operations are typically implemented in machine instructions can provide deeper insights.

    Code Example

    For simplicity, consider the following code:

    public class VolatileTest {
        private static volatile long someField;
    
        public static void main(String[] args) {
            someField = 10;
        }
    }
    

    Assembly Analysis of Write Operations

    Here’s the generated machine code corresponding to the write operation:

    0x0000019f2dc6efdb: movabsq       $0x76aea4020, %rsi
                                                ;   {oop(a 'java/lang/Class' = 'VolatileTest')}
    0x0000019f2dc6efe5: movabsq       $0xa, %rdi
    0x0000019f2dc6efef: movq          %rdi, 0x20(%rsp)
    0x0000019f2dc6eff4: vmovsd        0x20(%rsp), %xmm0
    0x0000019f2dc6effa: vmovsd        %xmm0, 0x68(%rsi)
    0x0000019f2dc6efff: lock addl     $0, (%rsp)  ;*putstatic someField
                                                ; - VolatileTest::main@3 (line 5)
    

    At first glance, the abundance of machine instructions directly interacting with registers might seem unnecessarily complex. However, this approach reflects specific architectural constraints and optimizations. Let us dissect these instructions step by step:

    Detailed Breakdown of Instructions

    1. movabsq $0x76aea4020, %rsi

      This instruction loads the absolute address (interpreted as a 64-bit numerical value) into the general-purpose register %rsi. From the comment, we see this address points to the class metadata object (java/lang/Class) containing information about the class and its static members. Since our volatile field is static, its address is calculated relative to this metadata object.

      • This approach ensures uniform handling of positive and negative 64-bit values due to the two's complement representation. While this may not seem critical here, it establishes a consistent method for managing both signed and unsigned integers.
    2. movabsq $0xa, %rdi

      Here, the immediate value 0xa (hexadecimal representation of 10) is loaded into the %rdi register. Since direct 64-bit memory writes using immediate values are prohibited in x86-64, this intermediate step is necessary.

    3. movq %rdi, 0x20(%rsp)

      The value from %rdi is then stored on the stack at an offset of 0x20 from the current stack pointer %rsp. This transfer is required because subsequent instructions will operate on SIMD registers, which cannot directly access general-purpose registers.

    4. vmovsd 0x20(%rsp), %xmm0

      This instruction moves the value from the stack into the SIMD register %xmm0. Although designed for floating-point operations, it efficiently handles 64-bit bitwise representations. The apparent redundancy here (loading and storing via the stack) is a trade-off for leveraging AVX optimizations, which can boost performance on modern microarchitectures like Sandy Bridge.

    5. vmovsd %xmm0, 0x68(%rsi)

      The value in %xmm0 is stored in memory at the address calculated relative to %rsi (0x68 offset). This represents the actual write operation to the volatile field.

    6. lock addl $0, (%rsp)

      The lock prefix ensures atomicity by locking the cache line corresponding to the specified memory address during execution. While addl $0 appears redundant, it serves as a lightweight no-op to enforce a full memory barrier, preventing reordering and ensuring visibility across threads.

    Multiple Writes and Memory Barriers

    Consider the following extended code:

    public class VolatileTest {
        private static volatile long someField;
    
        public static void main(String[] args) {
            someField = 10;
            someField = 11;
            someField = 12;
        }
    }
    

    For this sequence, the compiler inserts a memory barrier after each write:

    0x0000029ebe499bdb: movabsq       $0x76aea4070, %rsi
                                                ;   {oop(a 'java/lang/Class' = 'VolatileTest')}
    0x0000029ebe499be5: movabsq       $0xa, %rdi
    0x0000029ebe499bef: movq          %rdi, 0x20(%rsp)
    0x0000029ebe499bf4: vmovsd        0x20(%rsp), %xmm0
    0x0000029ebe499bfa: vmovsd        %xmm0, 0x68(%rsi)
    0x0000029ebe499bff: lock addl     $0, (%rsp)  ;*putstatic someField
                                                ; - VolatileTest::main@3 (line 5)
    
    0x0000029ebe499c04: movabsq       $0xb, %rdi
    0x0000029ebe499c0e: movq          %rdi, 0x28(%rsp)
    0x0000029ebe499c13: vmovsd        0x28(%rsp), %xmm0
    0x0000029ebe499c19: vmovsd        %xmm0, 0x68(%rsi)
    0x0000029ebe499c1e: lock addl     $0, (%rsp)  ;*putstatic someField
                                                ; - VolatileTest::main@9 (line 6)
    
    0x0000029ebe499c23: movabsq       $0xc, %rdi
    0x0000029ebe499c2d: movq          %rdi, 0x30(%rsp)
    0x0000029ebe499c32: vmovsd        0x30(%rsp), %xmm0
    0x0000029ebe499c38: vmovsd        %xmm0, 0x68(%rsi)
    0x0000029ebe499c3d: lock addl     $0, (%rsp)  ;*putstatic someField
                                                ; - VolatileTest::main@15 (line 7)
    

    Observations

    In summary, the intricate sequence of operations underscores the JVM’s efforts to balance atomicity, performance, and compliance with the Java Memory Model.


    For 32-Bit Systems: A Unique Challenge

    When running the example code on a 32-bit JVM, the behavior differs significantly due to hardware constraints inherent to 32-bit architectures. Let’s dissect the observed assembly code:

    0x02e837f0: movl        $0x2f62f848, %esi
                                            ;   {oop(a 'java/lang/Class' = 'VolatileTest')}
    0x02e837f5: movl        $0xa, %edi
    0x02e837fa: movl        $0, %ebx
    0x02e837ff: movl        %edi, 0x10(%esp)
    0x02e83803: movl        %ebx, 0x14(%esp)
    0x02e83807: vmovsd      0x10(%esp), %xmm0
    0x02e8380d: vmovsd      %xmm0, 0x58(%esi)
    0x02e83812: lock addl   $0, (%esp)  ;*putstatic someField
                                            ; - VolatileTest::main@3 (line 5)
    

    Register Constraints in 32-Bit Systems

    Unlike their 64-bit counterparts, 32-bit general-purpose registers such as %esi and %edi lack the capacity to directly handle 64-bit values. As a result, long values in 32-bit environments are processed in two separate parts: the lower 32 bits ($0xa in this case) and the upper 32 bits ($0). Each part is loaded into a separate 32-bit register and later combined for further processing. This limitation inherently increases the complexity of ensuring atomic operations.

    Atomicity Using SIMD Registers

    Despite the constraints of 32-bit general-purpose registers, SIMD registers such as %xmm0 offer a workaround. The vmovsd instruction is used to load the full 64-bit value into %xmm0 atomically. The two halves of the long value, previously placed on the stack at offsets 0x10(%esp) and 0x14(%esp), are accessed as a unified 64-bit value during this operation. This highlights the JVM’s efficiency in leveraging modern instruction sets like AVX for compatibility and performance in older architectures.

    For 32-Bit Systems: A More Intriguing Case

    Let’s delve into the behavior of the same example but run on a 32-bit JVM. Below is the assembly output generated during execution:

    0x02e837f0: movl        $0x2f62f848, %esi
                                        ;   {oop(a 'java/lang/Class' = 'VolatileTest')}
    0x02e837f5: movl        $0xa, %edi
    0x02e837fa: movl        $0, %ebx
    0x02e837ff: movl        %edi, 0x10(%esp)
    0x02e83803: movl        %ebx, 0x14(%esp)
    0x02e83807: vmovsd      0x10(%esp), %xmm0
    0x02e8380d: vmovsd      %xmm0, 0x58(%esi)
    0x02e83812: lock addl   $0, (%esp)  ;*putstatic someField
                                        ; - VolatileTest::main@3 (line 5)
    

    Here we see a similar unified approach to the 64-bit systems but driven more by necessity. In 32-bit systems, the absence of 64-bit general-purpose registers means the theoretical capabilities are significantly reduced.

    Why Use LOCK Selectively?

    In 32-bit systems, reads and writes are performed in two instructions rather than one. This inherently breaks atomicity, even with the LOCK prefix. While it might seem logical to rely on LOCK with its bus-locking capabilities, it is often avoided in such scenarios whenever possible due to its substantial performance impact.

    To maintain a priority for non-blocking mechanisms, developers often rely on SIMD instructions, such as those involving XMM registers. In our example, the vmovsd instruction is used, which loads the values $0xa and $0 (representing the lower and upper 32-bit halves of the 64-bit long value) into two different 32-bit registers. These are then stored sequentially on the stack and combined atomically using vmovsd.

    Simulating the Absence of AVX

    What happens if the processor lacks AVX support? By disabling AVX explicitly (-XX:UseAVX=0), we simulate an environment without AVX functionality. The resulting changes in the assembly are:

    0x02da3507: movsd       0x10(%esp), %xmm0 
    0x02da350d: movsd       %xmm0, 0x58(%esi)
    

    This highlights that the approach remains fundamentally the same. However, the vmovsd instruction is replaced with the older movsd from the SSE instruction set. While movsd lacks the performance enhancements of AVX and operates as a dual-operand instruction, it serves the same purpose effectively when AVX is unavailable.

    When SSE is Unavailable

    If SSE support is also disabled (-XX:UseSSE=0), the fallback mechanism relies on the Floating Point Unit (FPU):

    0x02bc2449: fildll      0x10(%esp)
    0x02bc244d: fistpll     0x58(%esi)
    

    Here, the fildll and fistpll instructions load and store the value directly to and from the FPU stack, bypassing the need for SIMD registers. Unlike typical FPU operations involving 80-bit extended precision, these instructions ensure the value remains a raw 64-bit integer, avoiding unnecessary conversions.

    The Challenge of Systems Without an FPU

    For processors such as the Intel 80486SX or 80386 without integrated coprocessors, the situation becomes even more challenging. These processors lack native instructions like CMPXCHG8B (introduced in the Intel Pentium series) and 64-bit atomicity mechanisms. In such cases, ensuring atomicity requires software-based solutions, such as OS-level mutex locks, which are significantly heavier and less efficient.

    Analyzing Reads from Volatile Fields

    Finally, let’s examine the behavior during a read operation, such as when retrieving a value for display. The following assembly demonstrates the process:

    0x02e62346: fildll      0x58(%ecx)
    0x02e62349: fistpll     0x18(%esp)  ;*getstatic someField
                                        ; - VolatileTest::main@9 (line 7)
    
    0x02e6234d: movl        0x18(%esp), %edi
    0x02e62351: movl        0x1c(%esp), %ecx
    0x02e62355: movl        %edi, (%esp)
    0x02e62358: movl        %ecx, 4(%esp)
    0x02e6235c: movl        %esi, %ecx  ;*invokevirtual println
                                        ; - VolatileTest::main@12 (line 7)
    

    The read operation essentially mirrors the write process but in reverse. The value is loaded from memory (e.g., 0x58(%ecx)) into ST0, then safely written to the stack. Since the stack is inherently thread-local, this intermediate step ensures that any further operations on the value are thread-safe.


    Test Environment Details

    All experiments and observations in this article were conducted using the following hardware and software configuration:

    Operating System

    Processor

    Java Development Kit (JDK)

    Two versions of Oracle JDK 1.8.0_431 were used during the experiments:

    1. 64-bit JDK:
      • Java HotSpot™ 64-Bit Server VM (build 25.431-b10, mixed mode).
    2. 32-bit JDK:
      • Java HotSpot™ Client VM (build 25.431-b10, mixed mode, sharing).

    JVM Settings

    The following JVM options were applied:

    Tools


    Conclusion and Acknowledgment

    This comprehensive exploration highlights the JVM's remarkable adaptability in enforcing volatile semantics across a range of architectures and processor capabilities. From AVX and SSE to FPU-based fallbacks, each approach balances performance, hardware limitations, and atomicity.

    Thank you for accompanying me on this deep dive into volatile. This analysis has answered many questions and broadened my understanding of low-level JVM implementations. I hope it has been equally insightful for you!