java multithreading performance low-latency hft

Low CPU usage polling architecture between 2 JVMs

Server Environment

Linux/RedHat
6 cores
Java 7/8

About application :

We are working on developing a low latency (7-8 ms) high speed trading platform using Java
There are 2 modules A & B each running on its own JVM
B gets data from A

Architecture:

we have made use of MemoryMaps & Unsafe. In this case, Module A writes into a memory mapped file & Module B reads from the file (both are holding address location to the file)
We went ahead & used an endless while-loop to continue reading till the desired value is obtained from the memory mapped file

Problem

CPU utilization shoots up to 100% & remains the same till its life cycle

Question :

Is there a more sophisticated way to keep polling for a value in the memory mapped file which involves minimum overheads, minimum delay & minimum CPU utilization? Note that every microsecond delay will deteriorate the performance

Code Snippet

Code snippet for the Module B (endless while-loop which polls & reads from the memory mapped file) is below

FileChannel fc_pointer = new RandomAccessFile(file, "rw").getChannel();
      MappedByteBuffer mem_file_pointer =fc_pointer.map(FileChannel.MapMode.READ_ONLY, 0, bufferSize);
      long address_file_pointer = ((DirectBuffer) mem_file_pointer).address();


    while(true)
    {
        int value_from_memory_mapped_file = unsafe.getInt(address_file_pointer);

        if (value_from_memory_mapped_file .. is different from the last read value)
        {
            //do some operation.... 
        //exit the routine;
        }
        else
        {
            continue;
        }
}//end of while

Solution

Highly loaded CPU is the real cost of the lowest latency possible. In a practical architecture, which uses a lock-free signaling, you should run no more than just a couple of Consumer-Producer pairs of threads per CPU socket. One pair eats one or two (one core per thread if not pinned to single Intel CPU core with Hyper-threading enabled) cores almost completely (that's why in most cases you have to think about horizontal scalability when you build ultra-low latency server system for many clients). BTW, don't forget to use "taskset" to pin each process to a specific core before performance tests and disable power management.
There is a well known trick when you lock a Consumer after a long period of spinning with no result. But you have to spend some time to park and then unpark the thread. Here is a moment of sporadic latency increasing, of course, but CPU core is free when the thread is parked. See, for example: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (8.4.4 Synchronization for Longer Periods) Also, a nice illustration for different waiting strategies for java can be found here: https://github.com/LMAX-Exchange/disruptor/wiki/Getting-Started (Alternative Wait Strategies)
If you are talking about milliseconds (ms), not microseconds (µs), you can just try TCP socket communication over loopback. It adds about 10 µs to pass a small amount of data from Producer to Consumer and this is blocking technique. Named Pipes has better latency characteristics than sockets, but they are non blocking really and you have to build something like a spinloop again. Memory Mapped Files + intrinsic Unsafe.getXXX (which is a single x86 MOV) is still the best IPC technique in terms of both latency and throughput since it doesn't require system calls while reading and writing.
If you are still going to use lock-free and Memory Mapped Files and direct access using Unsafe, don't forget about appropriate memory barriers for both Producer and Consumer. For example, "unsafe.getIntVolatile" instead of first "unsafe.getInt" if you are not sure your code always runs on later x86.
If you see unexpected CPU utilization which should be no more 30-40% (2 utilized cores for 6 cores CPU) per pair of Producer-Consumer, you have to use standard tools to check what is running on other cores and the overall system performance. If you see intensive IO associated with your mapped file, then make sure it is mapped to tmpfs file system to prevent real disk IO. Check memory bus loading and L3 cache misses for the "fattest" processes because, as we know, CPU time = (CPU exec clock cycles + _memory_stall_cycles_) * clock cycle time

And finally, a quite similar and interesting open source project with a good example how to use memory mapped files: http://openhft.net/products/chronicle-queue/