What's the benefit of Async File NIO in Java?
Asked Answered
F

2

16

According to the documentation of AsynchronousFileChannel and AsynchronousChannelGroup, async NIO is using a dedicated thread pool where "IO events are handled". I couldn't find any clear statement what "handling" means in this context but according to this, I'm pretty sure that at the end of the day, blocking occurs on those dedicated threads. To narrow things down, I'm using Linux and based on Alex Yursha's answer, there is no such thing as non-blocking IO on it, only Windows supports it on some levels.

My question is: what is the benefit of using async NIO versus synchronous IO running on a dedicated thread pool created by myself? Considering the introduced complexity, what would be a scenario when it would still worth to implement?

Fusiform answered 10/7, 2020 at 20:54 Comment(13)
Taking aside the fact that NIO could be implemented more efficiently by the JDK in the future or on some platforms, there is still the aspect of "what is the benefit of providing a facility as part of the standard library when I and everyone else can just implement the same thing independently by myself". Unless you can do better than the standard library, it would be a waste of time. Even if you can do better, it would need to be quite a bit better in order to justify the effort.Pareto
@Pareto I didn't mean to reimplement the wheel, but for example, encrypting a file via Streams and offloading it to a thread pool seems 100 times easier than implementing encryption using async callbacks. The future JDK improvement is a valid argument, but right now what is the benefit?Fusiform
That is a very odd argument. I would expect that streaming library to take care of efficient use of NIO under the hood already, and that should be 100 times easier than manually messing around with a thread pool. If encrypting a file via streams takes more than a couple of lines of code then I would start looking for another library.Pareto
Now that I am thinking about how I would actually do that, no such library comes to mind. Could be that I have been in Scala-land for too long. Maybe there is a hole to be filled. But my point was that NIO is not something to use directly, but something that libraries like Netty or Reactive Streams would use internally.Pareto
After Googling for a while, I only found this in the Reactor library: projectreactor.io/docs/netty/release/api/reactor/netty/… Creates a Flux of buffers being read from a file path using NIO via Netty. So yeah, unless you are already using this Reactor library this is a bit too much effort.Pareto
Akka Streams also does this: doc.akka.io/docs/akka/current/stream/…Pareto
@Pareto staying with the encryption example, with AsynchronousFileChannel it took me like 150 lines of code, messing with cipher update, cipher doFinal, maintaining read/write position, etc, while it's two lines with CipherOutputStream. I'm using Reactor + Spring, but sometimes their implementations are surprising as well, like FilePart's "reactive" transferTo method: github.com/spring-projects/spring-framework/blob/…Fusiform
"Akka Streams also does this". Well, maybe not. The documentation talks about having a dedicated pool for blocking IO, so probably no NIO in there either :-/ Could be that the pragmatic approach of not doing either, but just have straight blocking IO on the main thread is good enough? How hard can you hit that disk before it becomes a bottleneck for parallelism that you cannot get past anyway?Pareto
This is the loop I've got into a few weeks ago, I hope I didn't ruin your life as well. :D Thank you for reading and commenting this much about it.Fusiform
Start with whatever is the easiest to program, understand, debug. Focus on the application logic. Most likely it will be perfectly fine performance-wise. If not, only then go in and revisit.Pareto
As Knuth is quoted as saying “The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.” - Read more about premature optimization here.Ellen
I benchmarked it myself some time ago and found that Async File NIO has no obvious performance advantages but quite a few added complexity. See results here.Rohde
This is a duplicate of non-blocking IO vs async IO and implementation in JavaPaquette
A
15

It's mostly about handrolling your buffer sizes. In that way, you can save a lot of memory, but only if you're trying to handle a lot (many thousands) of simultaneous connections.

First some simplifications and caveats:

  • I'm going to assume a non-boneheaded scheduler. There are some OSes that just do a really poor job of juggling thousands of threads. There is no inherent reason that an OS will fall down when a user process fires up 1000 full threads, but some OSes do anyway. NIO can help there, but that's a bit of an unfair comparison - usually you should just upgrade your OS. Pretty much any linux, and I believe win10 definitely don't have issues with this many threads, but some old linux port on an ARM hack, or something like windows 7 - that might cause problems.

  • I'm going to assume you're using NIO to deal with incoming TCP/IP connections (e.g. a web server, or IRC server, something like that). The same principles apply if you're trying to read 1000 files simultaneously, but note that you do need to think about where the bottleneck lies. For example, reading 1000 files simultaneously from a single disk is a pointless exercise - that just slows things down as you're making life harder for the disk (this counts double if it's a spinning disk). For networking, especially if you're on a fast pipe, the bottleneck is not the pipe or your network card, which makes 'handle 1000s of connections simultaneously' a good example. In fact, I'm going to use as example a chat server where 1000 people all connect to one giant chatroom. The job is to receive text messages from anybody connected and send them out to everybody.

The synchronous model

In the synchronous model, life is relatively simple: We'll make 2001 threads:

  • 1 thread to listen for new incoming TCP connections on a socket. This thread will create the 2 'handler' threads and go back to listening for new connections.
  • per user a thread that reads from the socket until it sees an enter symbol. If it sees this, it will take all text received so far, and notify all 1000 'sender' threads with this new string that needs to be sent out.
  • per user a thread that will send out the strings in a buffer of 'text messages to send out'. If there's nothing left to send it will wait until a new message is delivered to it.

Each individual moving piece is easily programmed. Some tactical use of a single java.util.concurrent datatype, or even some basic synchronized() blocks will ensure we don't run into any race conditions. I envision maybe 1 page of code for each piece.

But, we do have 2001 threads. Each thread has a stack. In JVMs, each thread gets the same size stack [EDITED]unless you explicitly define the stack size when you create a thread - you should absolutely be doing this if you're going to spin off 2000 threads like in this example, and make them as small as you can reasonably get away with [/EDITED], by default you configure how large these stacks are with the -Xss parameter. You can make them as small as, say, 128k, but even then that's still 128k * 2001 = ~256MB just for the stacks ([EDIT] These days you can make them a lot smaller, maybe 32k even - this synchronous model would work totally fine, you need a lot more threads before this model becomes untenable![/EDIT]), we haven't covered any of the heap (all those strings that people are sending back and forth, stuck in send queues), or the app itself, or the JVM basics.

Under the hood, what's going to happen to the CPU which has, say, 16 cores, is that there are 2001 threads and each thread has its own set of conditions which would result in it waking up. For the receivers it's data coming in over the pipe, for the senders its either the network card indicating it is ready to send another packet (in case it's waiting to push data down the line), or waiting for a obj.wait() call to get notified (the threads that receive text from the users would add that string to all the queues of each of the 1000 senders and then notify them all).

That's a lot of context switching: A thread wakes up, sees Joe: Hello, everybody, good morning! in the buffer, turns that into a packet, blits it to the memory buffer of the network card (this is all extremely fast, it's just CPU and memory interacting), and will fall back asleep, for example. The CPU core will then move on and find another thread that is ready to do some work.

CPU cores have on-core caches; in fact, there's a hierarchy. There's main RAM, then L3 cache, L2 cache, on-core cache - and a CPU cannot really operate on RAM anymore in modern architecture, they need for the infrastructure around the chip to realize that it needs to read or write to memory that is on a page that isn't in one of these caches, then the CPU will just freeze for a while until the infra can copy over that page of RAM into one of the caches.

Every time a core switches, it is highly likely that it needs to load a new page, and that can take many hundreds of cycles where the CPU is twiddling its thumbs. A badly written scheduler would cause a lot more of this than is needed. If you read about advantages of NIO, often 'those context switches are expensive!' comes up - this is more or less what they are talking about (but, spoiler alert: The async model also suffers from this!)

The async model

In the synchronous model, the job of figuring out which of the 1000 connected users is ready for stuff to happen is 'stuck' in threads waiting on events; the OS is juggling those 1000 threads and will wake up threads when there's stuff to do.

In the async model we switch it up: We still have threads, but far fewer (one to two for each core is a good idea). That's far fewer threads than connected users: Each thread is responsible for ALL the connections, instead of only for 1 connection. That means each thread will do the job of checking which of the connected users have stuff to do (their network pipe has data to read, or is ready for us to push more data down the wire to them).

The difference is in what the thread asks the OS:

  • [synchronous] Okay, I want to go to sleep until this one connection sends data to me.
  • [async] Okay, I want to go to sleep until one of these thousand connections either sends data to me, or I registered that I'm waiting for the network buffer to clear because I have more data to send, and the network is clear, or the socketlistener has a new user connecting.

There is no inherent speed or design advantage to either model - we're just shifting the job around between app and OS.

One advantage often touted for NIO is that you don't need to 'worry' about race conditions, synchronizing, concurrency-safe data structures. This is a commonly repeated falsehood: CPUs have many cores, so if your non-blocking app only ever makes one thread, the vast majority of your CPU will just sit there idle doing nothing, that is highly inefficient.

The great upside here is: Hey, only 16 threads. That's 128k * 16 = 2MB of stack space. That's in stark contrast to the 256MB that the sync model took! However, a different thing now happens: In the synchronous model, a lot of state info about a connection is 'stuck' in that stack. For example, if I write this:

Let's assume the protocol is: client sends 1 int, it's the # of bytes in the message, and then that many bytes, which is the message, UTF-8 encoded.

// synchronous code
int size = readInt();
byte[] buffer = new byte[size];
int pos = 0;
while (pos < size) {
    int r = input.read(buffer, pos, size - pos);
    if (r == -1) throw new IOException("Client hung up");
    pos += r;
}
sendMessage(username + ": " + new String(buffer, StandardCharsets.UTF_8));

When running this, the thread is most likely going to end up blocking on that read call to the inputstream, as that will involve talking to the network card and moving some bytes from its memory buffers into this process's buffers to get the job done. Whilst its frozen, the pointer to that byte array, the size variable, r, etcetera are all in stack.

In the async model, it doesn't work that way. In the async model, you get data given to you, and you get given whatever is there, and you must then handle this because if you don't, that data is gone.

So, in the async model you get, say, half of the Hello everybody, good morning! message. You get the bytes that represent Hello eve and that's it. For that matter, you got the total byte length of this message already and need to remember that, as well as the half you received so far. You need to explicitly make an object and store this stuff somewhere.

Here's the key point: With the synchronous model, a lot of your state info is in stacks. In the async model, you make the data structures to store this state yourself.

And because you make these yourself, they can be dynamically sized, and generally far smaller: You just need ~4 bytes to store size, another 8 or so for a pointer to the byte array, a handful for the username pointer and that's about it. That's orders of magnitude less than the 128k that stack is taking to store that stuff.

Now, another theoretical benefit is that you don't get the context switch - instead of the CPU and OS having to swap to another thread when a read() call has no data left to give you because the network card is waiting for data, it's now the thread's job to go: Okay, no problem - I shall move on to another context object.

But that's a red herring - it doesn't matter if the OS is juggling 1000 context concepts (1000 threads), or if your application is juggling 1000 context concepts (these 'tracker' objects). It's still 1000 connections and everybody chatting away, so every time your thread moves on to check another context object and fill its byte array with more data, most likely it's still a cache miss and the CPU is still going to twiddle its thumbs for hundreds of cycles whilst the hardware infrastructure pulls the appropriate page from main RAM into the caches. So that part is not nearly as relevant, though the fact that the context objects are smaller is going to reduce cache misses somewhat.

That gets us back to: The primary benefit is that you get to handroll those buffers, and in so doing, you can both make them far smaller, and size them dynamically.

The downsides of async

There's a reason we have garbage collected languages. There is a reason we don't write all our code in assembler. Carefully managing all these finicky details by hand is usually not worth it. And so it is here: Often that benefit is not worth it. But just like GFX drivers and kernel cores have a ton of machine code, and drivers tend to be written in hand-managed memory environments, there are cases where careful management of those buffers is very much worth it.

The cost is high, though.

Imagine a theoretical programming language with the following properties:

  • Each function is either red or blue.
  • A red function can call blue or red functions, no problem.
  • A blue function can also call both, but if a blue function calls a red function, you have a bug that is almost impossible to test for but will kill your performance on realistic loads. Blue can call red functions only by going out of their way to define both the call and the response to the result of the call separately and injecting this pair into a queue.
  • functions tend not to document their colour.
  • Some system functions are red.
  • Your function must be blue.

This seems like an utterly boneheaded disaster of a language, no? But that's exactly the world you live in when writing async code!

The problem is: Within async code, you cannot call a blocking function because if it blocks, hey, that's one of only 16 threads that is now blocked, and that immediately means your CPU is now doing 1/16ths nothing. If all 16 threads end up in that blocking part the CPU is literally doing nothing at all and everything is frozen. You just can't do it.

There is a ton of stuff that blocks: Opening files, even touching a class never touched before (that class needs to be loaded from the jar from disk, verified, and linked), so much as looking at a database, doing a quick network check, sometimes asking for the current time will do it. Even logging at debug level might do it (if that ends up writing to disk, voila - blocking operation).

Do you know of any logging framework that either promises to fire up a separate thread to process logs onto disk, or goes out of its way to document if it blocks or not? I don't know of any, either.

So, methods that block are red, your async handlers are blue. Tada - that's why async is so incredibly difficult to truly get right.

The executive summary

Writing async code well is a real pain due to the coloured functions issue. It's also not on its face faster - in fact, it's usually slower. Async can win big if you want to run many thousands of operations simultaneously and the amount of storage required to track the relevant state data for each individual operation is small, because you get to handroll that buffer instead of being forced into relying on 1 stack per thread.

If you have some money left over, well, a developer salary buys you a lot of sticks of RAM, so usually the right option is to go with threads and just opt for a box with a lot of RAM if you want to handle many simultaneous connections.

Note that sites like youtube, facebook, etc effectively take the 'toss money at RAM' solution - they shard their product so that many simple and cheap computers work together to serve up a website. Don't knock it.

Examples where async can really shine is the chat app I've described in this answer. Another is, say, you receiving a short message, and all you do is hash it, encrypt the hash, and respond with it (To hash, you don't need to remember all the bytes flowing in, you can just toss each byte into the hasher which has constant memory load, and when the bytes are all sent, voila, you have your hash). You're looking for little state per operation and not much CPU power either relative to the speed at which the data is provided.

Some bad examples: are a system where you need to do a bunch of DB queries (you'd need an async way to talk to your DB, and in general DBs are bad at trying to run 1000 queries simultaneously), a bitcoin mining operation (the bitcoin mining is the bottleneck, there's no point trying to handle thousands of connections simultaneously on one machine).

Arabeila answered 29/10, 2020 at 14:39 Comment(2)
when loom ends up in the mainstream, this paradigm of asynchronous might be entirely irrelevant, one way or the way. 1+ stillLatterll
The red/blue functions metaphor could be confusing in context of Java. It's originally related to languages with async functions such as C# and Python. All the asynchronous and synchronous programming constructs available in Java are pretty composable between each other.Averett
E
1

According to the javadoc https://docs.oracle.com/javase/7/docs/api/java/nio/channels/AsynchronousChannelGroup.html#withThreadPool(java.util.concurrent.ExecutorService) this threadpool is used to run completion handlers https://docs.oracle.com/javase/7/docs/api/java/nio/channels/CompletionHandler.html - the handlers used to process IO events not actual IO. As i understand in node.js we have single event loop - in java we have ability to run callbacks concurrently in threads of this Thread pool.

Elocution answered 29/8, 2023 at 14:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.