Is there any point in async file IO?

Asked 5/1, 2022 at 20:39 Answered 14/3, 2023 at 15:16

Solved asynchronous rust async-await concurrency rust-tokio

Async runtimes for Rust like tokio provide "asyncified" copies of many standard functions, including some file IO ones, which work by basically just summoning the corresponding blocking task (on a new thread?). Examples of such functions are tokio::fs::create_dir_all, tokio::fs::read_dir, tokio::fs::read, ...

What's the advantage of all these functions? Why should I prefer using them over standard blocking functions in async context? If I'm .awaiting for their results, is there any gain at all?

An example would be an async web route that returns the contents of some file based on the query (using Rocket):

#[get("/foo/<query>")]
async fn foo(query: &str) -> Option<String> {
    let file_path = // interpret `query` somehow and find the target file
    tokio::fs::read_to_string(file_path).await.ok()
    // ^ Why not just `std::fs::read_to_string(file_path).ok()`?
}

I understand the gains of async/.await for socket IO or delayed tasks (with thread sleep), but in this case it seems pointless to me. But the opposite — this makes more complex tasks much more difficult to solve in code (working with streams when searching a file in a list of directories, for example).

Slavery answered 5/1, 2022 at 20:39 Comment(9)

Yes, because if there are any other concurrent tasks, .await allows them to run while the other thread with the file IO on it is blocked. Using blocking calls in an async task will block the thread, thus blocking all tasks that are running on the same thread. – Aten 5/1, 2022 at 20:52

Doesn't .await only allow other tasks to run if the task being awaited returned the Pending state last time it was polled by the executor? From what I understood from the async book, this is only returned when the task is not ready to produce any output yet, which is never the case for file IO (because the task is always ready to return some bytes, unlike, say, socket IO, whose tasks are often just waiting for new connections). If this is correct then file IO tasks never return Pending, and are almost the same as their blocking counterparts. – Slavery 5/1, 2022 at 20:58

It looks like your question might be answered by the answers of Why does Future::select choose the future with a longer sleep period first?. If not, please edit your question to explain the differences. Otherwise, we can mark this question as already answered. – Cookhouse 5/1, 2022 at 21:4

std::fs::read_to_string will block the thread, preventing Tokio (or any other async runtime) from running any other asynchronous tasks. If you don't want to do other things while performing IO, you don't need to use async in the first place. – Cookhouse 5/1, 2022 at 21:6

"which is never the case for file IO - because the task is always ready to return some bytes" - this is categorically false, while the data may be available immediately (particularly if the file has been read recently and still in the OS caches), most file reads will take some non-negligible amount of time, time which can be better spent doing other stuff instead of waiting around. Consider the timing of hard drive spin-up, network drives, virtual filesystems, and large reads. – Pirogue 5/1, 2022 at 21:14

because the task is always ready to return some bytes... - This is only true if you naively attempt to query the file descriptor using epoll() or the like. The fact that the file descriptor is "ready" doesn't mean that reading from it won't block for the duration it takes to read the contents from disk (or, in case of network-mounted file system, from the network, which may take an arbitrary time or never finish). The async versions of std::fs operations resolve that by off-loading the blocking operations to a separate thread and arranging for wakeup when the whole operation is complete. – Isoleucine 5/1, 2022 at 21:33

@Pirogue Thanks for correcting me, I didn't know about such delays. Does this mean that thread pools are generally a better choice than async/await when writing a mostly file-I/O-bound web-server though (say, a file synchronization server)? If I now understand this correctly, async just shifts the I/O tasks to other ("non-http-server"?) threads to keep accepting new connections. If this is correct, then it doesn't seem much different from processing requests in a thread pool. And if so, when should I prefer async (when will it give some real significant advantage)? Should this be a post edit? – Slavery 7/1, 2022 at 16:35

Honestly, for a file server where you're just shuffling bytes around 90% of the time, it probably doesn't matter. – Pirogue 7/1, 2022 at 19:5

FWIW on Linux, reads from disk files always block and always report that IO is ready when using select/poll interfaces, which is what most async libraries do. So async won't really help when doing disk file IO unless the library is using the aio interface or from another thread. – Earthenware 7/1, 2022 at 20:26

I guess you're reading a small files on a local filesystem with a pretty fast drive. If that's the case, there may be little point in using the async version of these functions.

If half of your HTTP requests need to read from the filesystem, then you might start noticing a substantial time where your runtime if waiting for blocking IO. This really depend on the nature of your application. Maybe you have one thread? Maybe you have many?

However, there's edge-case scenarios where filesystem can be slow enough to be a really big problem. Here's two extreme corner cases:

A network mounted filesystem (e.g.: NFS, ipfs). There can be multiple network round-trips under create_dir_all. While that's blocking, your service is basically non-responsive.
Slow hard drive. Spinning disks. Or even reading from a CD-ROM drive. Sure, you won't run your web server from a CD-ROM, but a tool that compares whether two magnetic tapes (yes, physical tapes are still used for backups) are identical would suffer greatly if some underlying library is doing blocking IO.

Now, if you're writing a library that exposes an async API, you can't make assumptions about the underlying filesystem or its backing hardware, and should use non-blocking IO.

Perfuse answered 14/3, 2023 at 15:16 Comment(0)

The difference between tokio::fs::read_to_string and std::fs::read_to_string is that the Tokio function will offload the file IO call to the spawn_blocking threadpool, whereas the std::fs::read_to_string call will not do that.

This is important because if you don't offload the file IO to a separate thread, then you are blocking the runtime from making progress, which means that other tasks in the runtime will be unable to execute for the duration of your file IO operation. See the link for more a in-depth explanation.

Firehouse answered 9/1, 2022 at 11:40 Comment(0)

I guess you're reading a small files on a local filesystem with a pretty fast drive. If that's the case, there may be little point in using the async version of these functions.

However, there's edge-case scenarios where filesystem can be slow enough to be a really big problem. Here's two extreme corner cases:

A network mounted filesystem (e.g.: NFS, ipfs). There can be multiple network round-trips under create_dir_all. While that's blocking, your service is basically non-responsive.
Slow hard drive. Spinning disks. Or even reading from a CD-ROM drive. Sure, you won't run your web server from a CD-ROM, but a tool that compares whether two magnetic tapes (yes, physical tapes are still used for backups) are identical would suffer greatly if some underlying library is doing blocking IO.

Now, if you're writing a library that exposes an async API, you can't make assumptions about the underlying filesystem or its backing hardware, and should use non-blocking IO.

Perfuse answered 14/3, 2023 at 15:16 Comment(0)

Recommended topics

Hot tags