Using select/poll/kqueue/kevent to watch a directory for new files
Asked Answered
M

6

7

In my app I need to watch a directory for new files. The amount of traffic is very large and there are going to be a minimum of hundreds of new files per second appearing. Currently I'm using a busy loop with this kind of idea:

while True:
  time.sleep(0.2)
  if len(os.listdir('.')) > 0:
    # do stuff

After running profiling I'm seeing a lot of time spent in the sleep, and I'm wondering if I should change this to use polling instead.

I'm trying to use one of the available classes in select to poll my directory, but I'm not sure if it actually works, or if I'm just doing it wrong.

I get an fd for my directory with:

fd = os.open('.', os.O_DIRECT)

I've then tried several methods to see when the directory changes. As an example, one of the things I tried was:

poll = select.poll()
poll.register(fd, select.POLLIN)

poll.poll()  # returns (fd, 1) meaning 'ready to read'

os.read(fd, 4096) # prints largely gibberish but i can see that i'm pulling the files/folders contained in the directory at least

poll.poll()  # returns (fd, 1) again

os.read(fd, 4096) # empty string - no more data

Why is poll() acting like there is more information to read? I assumed that it would only do that if something had changed in the directory.

Is what I'm trying to do here even possible?

If not, is there any other better alternative to while True: look for changes ?

Meli answered 22/7, 2009 at 14:13 Comment(0)
K
1

After running profiling I'm seeing a lot of time spent in the sleep, and I'm wondering if I should change this to use polling instead.

Looks like you already do synchronous polling, by checking the state at regular intervals. Don't worry about the time "spent" in sleep, it won't eat CPU time. It just passes control to the operating system which wakes the process up after a requested timeout.

You could consider asynchronous event loop using a library that listens to filesystem change notifications provided by the operating system, but consider first if it gives you any real benefits in this particular situation.

Kildare answered 24/7, 2009 at 23:24 Comment(0)
A
8

FreeBSD and thus Mac OS X provide an analog of inotify called kqueue. Type man 2 kqueue on a FreeBSD machine for more information. For kqueue on Freebsd you have PyKQueue available at http://people.freebsd.org/~dwhite/PyKQueue/, unfortunately is not actively maintained so your mileage may vary.

Aspidistra answered 22/9, 2009 at 7:24 Comment(1)
Ah thanks for this. At the time of writing all the SO questions about watching a directory don't give OS X answers.Schrimsher
B
4

Why not use a Python wrapper for one of the libraries for monitoring file changes, like gamin or inotify (search for pyinotify, I'm only allowed to post one hyperlink as a new user...) - that's sure to be faster and the low-level stuff is already done at C level for you, using kernel interfaces...

Blount answered 24/7, 2009 at 15:24 Comment(2)
I'm using BSD so inotify isn't usable and it looks like gamin isn't either.Meli
The gamin docs says it's usable on FreeBSD but uses a less optimal polling solution - it may still be faster than anything else thoughBlount
P
2

If your system has select.kqueue() it's a really good way to solve this, for example:

import os
import select

dn = '/tmp'
kq = select.kqueue()
fd = os.open(dn, os.O_DIRECT)

last = set(os.listdir(dn))

kevent = select.kevent(fd, filter=select.KQ_FILTER_VNODE,
    flags=select.KQ_EV_ADD | select.KQ_EV_CLEAR,
    fflags=select.KQ_NOTE_WRITE)

while True:
    if kq.control([kevent], 1):
        this = set(os.listdir(dn))

        added = list(this.difference(last))
        if added:
            print('  added: %s' % ' '.join(added))

        removed = list(last.difference(this))
        if removed:
            print('removed: %s' % ' '.join(removed))

        last = this
Preceptive answered 26/3, 2020 at 22:19 Comment(0)
K
1

After running profiling I'm seeing a lot of time spent in the sleep, and I'm wondering if I should change this to use polling instead.

Looks like you already do synchronous polling, by checking the state at regular intervals. Don't worry about the time "spent" in sleep, it won't eat CPU time. It just passes control to the operating system which wakes the process up after a requested timeout.

You could consider asynchronous event loop using a library that listens to filesystem change notifications provided by the operating system, but consider first if it gives you any real benefits in this particular situation.

Kildare answered 24/7, 2009 at 23:24 Comment(0)
C
0

You might want to have a look at select.kqueue - I've not used it but kqueue is the right interface for this under BSD I believe so you can monitor files / directories and be called back when and only when they change

Cottrell answered 22/9, 2009 at 12:58 Comment(0)
H
0

I've written a library and a shell tool that should handle this for you.

http://github.com/gorakhargosh/watchdog

Although, kqueue is a very heavyweight way to monitor directories I'd appreciate if you can test and checkout any performance problems you might encounter. Patches are also welcome.

HTH.

Hognut answered 15/12, 2010 at 9:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.