javaasynchronousniojava-io

What's the benefit of Async File NIO in Java?


According to the documentation of AsynchronousFileChannel and AsynchronousChannelGroup, async NIO is using a dedicated thread pool where "IO events are handled". I couldn't find any clear statement what "handling" means in this context but according to this, I'm pretty sure that at the end of the day, blocking occurs on those dedicated threads. To narrow things down, I'm using Linux and based on Alex Yursha's answer, there is no such thing as non-blocking IO on it, only Windows supports it on some levels.

My question is: what is the benefit of using async NIO versus synchronous IO running on a dedicated thread pool created by myself? Considering the introduced complexity, what would be a scenario when it would still worth to implement?


Solution

  • It's mostly about handrolling your buffer sizes. In that way, you can save a lot of memory, but only if you're trying to handle a lot (many thousands) of simultaneous connections.

    First some simplifications and caveats:

    The synchronous model

    In the synchronous model, life is relatively simple: We'll make 2001 threads:

    Each individual moving piece is easily programmed. Some tactical use of a single java.util.concurrent datatype, or even some basic synchronized() blocks will ensure we don't run into any race conditions. I envision maybe 1 page of code for each piece.

    But, we do have 2001 threads. Each thread has a stack. In JVMs, each thread gets the same size stack [EDITED]unless you explicitly define the stack size when you create a thread - you should absolutely be doing this if you're going to spin off 2000 threads like in this example, and make them as small as you can reasonably get away with [/EDITED], by default you configure how large these stacks are with the -Xss parameter. You can make them as small as, say, 128k, but even then that's still 128k * 2001 = ~256MB just for the stacks ([EDIT] These days you can make them a lot smaller, maybe 32k even - this synchronous model would work totally fine, you need a lot more threads before this model becomes untenable![/EDIT]), we haven't covered any of the heap (all those strings that people are sending back and forth, stuck in send queues), or the app itself, or the JVM basics.

    Under the hood, what's going to happen to the CPU which has, say, 16 cores, is that there are 2001 threads and each thread has its own set of conditions which would result in it waking up. For the receivers it's data coming in over the pipe, for the senders its either the network card indicating it is ready to send another packet (in case it's waiting to push data down the line), or waiting for a obj.wait() call to get notified (the threads that receive text from the users would add that string to all the queues of each of the 1000 senders and then notify them all).

    That's a lot of context switching: A thread wakes up, sees Joe: Hello, everybody, good morning! in the buffer, turns that into a packet, blits it to the memory buffer of the network card (this is all extremely fast, it's just CPU and memory interacting), and will fall back asleep, for example. The CPU core will then move on and find another thread that is ready to do some work.

    CPU cores have on-core caches; in fact, there's a hierarchy. There's main RAM, then L3 cache, L2 cache, on-core cache - and a CPU cannot really operate on RAM anymore in modern architecture, they need for the infrastructure around the chip to realize that it needs to read or write to memory that is on a page that isn't in one of these caches, then the CPU will just freeze for a while until the infra can copy over that page of RAM into one of the caches.

    Every time a core switches, it is highly likely that it needs to load a new page, and that can take many hundreds of cycles where the CPU is twiddling its thumbs. A badly written scheduler would cause a lot more of this than is needed. If you read about advantages of NIO, often 'those context switches are expensive!' comes up - this is more or less what they are talking about (but, spoiler alert: The async model also suffers from this!)

    The async model

    In the synchronous model, the job of figuring out which of the 1000 connected users is ready for stuff to happen is 'stuck' in threads waiting on events; the OS is juggling those 1000 threads and will wake up threads when there's stuff to do.

    In the async model we switch it up: We still have threads, but far fewer (one to two for each core is a good idea). That's far fewer threads than connected users: Each thread is responsible for ALL the connections, instead of only for 1 connection. That means each thread will do the job of checking which of the connected users have stuff to do (their network pipe has data to read, or is ready for us to push more data down the wire to them).

    The difference is in what the thread asks the OS:

    There is no inherent speed or design advantage to either model - we're just shifting the job around between app and OS.

    One advantage often touted for NIO is that you don't need to 'worry' about race conditions, synchronizing, concurrency-safe data structures. This is a commonly repeated falsehood: CPUs have many cores, so if your non-blocking app only ever makes one thread, the vast majority of your CPU will just sit there idle doing nothing, that is highly inefficient.

    The great upside here is: Hey, only 16 threads. That's 128k * 16 = 2MB of stack space. That's in stark contrast to the 256MB that the sync model took! However, a different thing now happens: In the synchronous model, a lot of state info about a connection is 'stuck' in that stack. For example, if I write this:

    Let's assume the protocol is: client sends 1 int, it's the # of bytes in the message, and then that many bytes, which is the message, UTF-8 encoded.

    // synchronous code
    int size = readInt();
    byte[] buffer = new byte[size];
    int pos = 0;
    while (pos < size) {
        int r = input.read(buffer, pos, size - pos);
        if (r == -1) throw new IOException("Client hung up");
        pos += r;
    }
    sendMessage(username + ": " + new String(buffer, StandardCharsets.UTF_8));
    

    When running this, the thread is most likely going to end up blocking on that read call to the inputstream, as that will involve talking to the network card and moving some bytes from its memory buffers into this process's buffers to get the job done. Whilst its frozen, the pointer to that byte array, the size variable, r, etcetera are all in stack.

    In the async model, it doesn't work that way. In the async model, you get data given to you, and you get given whatever is there, and you must then handle this because if you don't, that data is gone.

    So, in the async model you get, say, half of the Hello everybody, good morning! message. You get the bytes that represent Hello eve and that's it. For that matter, you got the total byte length of this message already and need to remember that, as well as the half you received so far. You need to explicitly make an object and store this stuff somewhere.

    Here's the key point: With the synchronous model, a lot of your state info is in stacks. In the async model, you make the data structures to store this state yourself.

    And because you make these yourself, they can be dynamically sized, and generally far smaller: You just need ~4 bytes to store size, another 8 or so for a pointer to the byte array, a handful for the username pointer and that's about it. That's orders of magnitude less than the 128k that stack is taking to store that stuff.

    Now, another theoretical benefit is that you don't get the context switch - instead of the CPU and OS having to swap to another thread when a read() call has no data left to give you because the network card is waiting for data, it's now the thread's job to go: Okay, no problem - I shall move on to another context object.

    But that's a red herring - it doesn't matter if the OS is juggling 1000 context concepts (1000 threads), or if your application is juggling 1000 context concepts (these 'tracker' objects). It's still 1000 connections and everybody chatting away, so every time your thread moves on to check another context object and fill its byte array with more data, most likely it's still a cache miss and the CPU is still going to twiddle its thumbs for hundreds of cycles whilst the hardware infrastructure pulls the appropriate page from main RAM into the caches. So that part is not nearly as relevant, though the fact that the context objects are smaller is going to reduce cache misses somewhat.

    That gets us back to: The primary benefit is that you get to handroll those buffers, and in so doing, you can both make them far smaller, and size them dynamically.

    The downsides of async

    There's a reason we have garbage collected languages. There is a reason we don't write all our code in assembler. Carefully managing all these finicky details by hand is usually not worth it. And so it is here: Often that benefit is not worth it. But just like GFX drivers and kernel cores have a ton of machine code, and drivers tend to be written in hand-managed memory environments, there are cases where careful management of those buffers is very much worth it.

    The cost is high, though.

    Imagine a theoretical programming language with the following properties:

    This seems like an utterly boneheaded disaster of a language, no? But that's exactly the world you live in when writing async code!

    The problem is: Within async code, you cannot call a blocking function because if it blocks, hey, that's one of only 16 threads that is now blocked, and that immediately means your CPU is now doing 1/16ths nothing. If all 16 threads end up in that blocking part the CPU is literally doing nothing at all and everything is frozen. You just can't do it.

    There is a ton of stuff that blocks: Opening files, even touching a class never touched before (that class needs to be loaded from the jar from disk, verified, and linked), so much as looking at a database, doing a quick network check, sometimes asking for the current time will do it. Even logging at debug level might do it (if that ends up writing to disk, voila - blocking operation).

    Do you know of any logging framework that either promises to fire up a separate thread to process logs onto disk, or goes out of its way to document if it blocks or not? I don't know of any, either.

    So, methods that block are red, your async handlers are blue. Tada - that's why async is so incredibly difficult to truly get right.

    The executive summary

    Writing async code well is a real pain due to the coloured functions issue. It's also not on its face faster - in fact, it's usually slower. Async can win big if you want to run many thousands of operations simultaneously and the amount of storage required to track the relevant state data for each individual operation is small, because you get to handroll that buffer instead of being forced into relying on 1 stack per thread.

    If you have some money left over, well, a developer salary buys you a lot of sticks of RAM, so usually the right option is to go with threads and just opt for a box with a lot of RAM if you want to handle many simultaneous connections.

    Note that sites like youtube, facebook, etc effectively take the 'toss money at RAM' solution - they shard their product so that many simple and cheap computers work together to serve up a website. Don't knock it.

    Examples where async can really shine is the chat app I've described in this answer. Another is, say, you receiving a short message, and all you do is hash it, encrypt the hash, and respond with it (To hash, you don't need to remember all the bytes flowing in, you can just toss each byte into the hasher which has constant memory load, and when the bytes are all sent, voila, you have your hash). You're looking for little state per operation and not much CPU power either relative to the speed at which the data is provided.

    Some bad examples: are a system where you need to do a bunch of DB queries (you'd need an async way to talk to your DB, and in general DBs are bad at trying to run 1000 queries simultaneously), a bitcoin mining operation (the bitcoin mining is the bottleneck, there's no point trying to handle thousands of connections simultaneously on one machine).