How can I asyncio schedule a filesystem stat operation?
Asked Answered
S

1

6

Converting some code to using asyncio, I'd like to give back control to the asyncio.BaseEventLoop as quickly as possible. This means to avoid blocking waits.

Without asyncio I'd use os.stat() or pathlib.Path.stat() to obtain e.g. the filesize. Is there a way to do this efficiently with asyncio?

Can I just wrap the stat() call so it is a future similar to what's described here?

Sagittate answered 24/6, 2016 at 7:55 Comment(2)
You mean: you want a non-blocking os.stat() so other coroutines can run during it?Mouse
@Julien: Yes, I think so ;-) To have the main code run in parallel, I would be forced to use threads instead of asyncio, wouldn't I?Sagittate
M
6

os.stat() translates to a stat syscall:

$ strace python3 -c 'import os; os.stat("/")'
[...]
stat("/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[...]

which is blocking, and there's no way to get a non-blocking stat syscall.

asyncio provides non-blocking I/O by using non-blocking system calls, which already exists (see man fcntl, with its O_NONBLOCK flag, or ioctl), so asyncio is not making syscalls asynchronous, it exposes already asynchronous syscalls in a nice way.

It's still possible to use the nice ThreadPoolExecutor abstraction to make your blocking stat calls in parallel using a pool of threads.

But you may first consider some other parameters:

  • According to strace -T, stat is fast: stat("/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 <0.000007>, probably faster than starting and synchronizing threads.
  • stat is probably in much cases IO bound, so using more CPUs won't help
  • Doing parallel I/O may break a nice sequential access to a random access, phisical hard drive may be slower in this context.

But there's also a lot of possibilities for your stats to be faster using a thread pool, like if you're hitting a distributed file system.

You may also take a look at functools.lru_cache: if you're doing multiple stat on the same file or directory, and you're sure it has not changed, caching the result avoids a syscall.

To conclude, "keep it simple", "os.stat" is the efficient way to get a filesize.

Mouse answered 24/6, 2016 at 20:58 Comment(2)
Thanks for your good thoughts and observations on this matter! I feared (or hoped?) as much. I have potentially millions of stats, and I indeed do not want to parallelize those (filesystem is a highly parallelized NAS, but on an nfs mount) The idea was to queue/pool the stats, but be able to do python-based other bookkeeping in parallel, so as not to wait for code execution even after the stats. Caching the stats is no option because it's for a tool that's supposed to check for differences on the filesystem (compare to rsync or zfs' scrub mechanism).Sagittate
For now I'll leave the stats as they are. Later I might compare to pushing them into one separate thread which communicates to the main thread via queues. It might still be faster to just do the stats inline with the rest of the code. Thanks again!Sagittate

© 2022 - 2024 — McMap. All rights reserved.