Why exactly does ePoll scale better than Poll?

F

2

13

Short question but for me its difficult to understand.

Frankforter answered 21/3, 2011 at 21:28 Comment(0)

V

19

The poll system call needs to copy your list of file descriptors to the kernel each time. This happens only once with epoll_ctl, but not every time you call epoll_wait.

Also, epoll_wait is O(1) in respect of the number of descriptors watched¹, which means it does not matter whether you wait on one descriptor or on 5,000 or 50,000 descriptors. poll, while being more efficient than select, still has to walk over the list every time (i.e. it is O(N) in respect of number of descriptors).

And lastly, epoll can in addition to the "normal" mode work in "edge triggered" mode, which means the kernel does not need keep track of how much data you've read after you've been signalled readiness. This mode is more difficult to grasp, but somewhat more efficient.

¹As correctly pointed out by David Schwartz, epoll_wait is of course still O(N) in respect of events that occur. There is hardly a way it could be any different, with any interface. If N events happen on a descriptor that is watched, then the application needs to get N notifications, and needs to do N "things" in order to react on what's happening.
This is again slightly, but not fundamentally different in edge triggered mode, where you actually get M events with M <= N. In edge triggered mode, when the same event (say, POLLIN) happens several times, you will probably get fewer notifications, possibly only a single one. However, this doesn't change much about the big-O notation as such.

However, epoll_wait is irrespective of the number of descriptors watched. Under the assumption that it is used in the intended, "normal" way (that is, many descriptors, few events), this is what really matters, and here it is indeed O(1).

As an analogy, you can think of a hash table. A hash table accesses its content in O(1), but one could argue that calculating the hash is actually O(N) in respect of the key length. This is technically absolutely correct, and there probably exist cases where this is a problem, however, for most people, this just doesn't matter.

Valuator answered 21/3, 2011 at 21:31 Comment(5)

Thank you for the answer. But if, with epoll, i just need to copy once the file descriptors how can i change the file descriptors list from the User Space? – Frankforter 22/3, 2011 at 1:37

You add/remove/change descriptors to/from/in the set using epoll_ctl. This changes, on your request, kernel structures (e.g. it adds the eventpoll structure to a file descriptor's wait queue). When you call epoll_wait, it just blocks. When something happens on a file descriptor, it wakes that descriptor's wait queue (in reality, it's more complicated than that, there's another queue inside the eventpoll struct so several threads can block on the same fd, but basically that's it). So actually it's rather a "targetted notify" than a "searching poll", which is why it's O(1) too. – Valuator 22/3, 2011 at 8:58

Actually, epoll_wait is O(N), not O(1). The number of discovered sockets is O(N) with epoll_wait, just like it is with poll. Each of those discovered sockets has to be copied into the application's epoll_event structure. If you double the number of sockets you poll on, you're double the number of sockets expected to have activity. – Massicot 7/2, 2013 at 18:12

@DavidSchwartz: You are in some way correct, epoll_wait is O(N) and O(1) at the same time. But what's importatn is that it is O(1) for the "normal" scenario that it was designed for. The assumption is that you watch many (dozens, hundreds, thousands) descriptors and few events (usually less than 5) actually happen at a time. epoll cannot do "magic", it is not (and cannot be) O(1) in respect of the events that happen - if N events occur, then something must be done for N events. However, if N ≈ 1, that's practically as good as O(1). What matters more is ... – Valuator 8/2, 2013 at 14:29

... that it's O(1) in respect of the number of descriptors. Because if you have N ≈ 10000 then O(1) or O(N) aren't just the same thing (quite obvious). Here, it really, really matters. Your objection is correct insofar as if one uses the epoll interface in a way that is contrary to its intended use and has the expectation of "magic" happening, then one may be disappointed. In such a case, plain old poll may both easier and just as fast. I've made mention of that in this lengthy answer that talks about epoll among others (ca. in the middle). – Valuator 8/2, 2013 at 14:34

M

21

While Damon's reason is correct for the unusual case where you never block on a socket, in typical real-world programs, the reason is completely different. A typical program looks like this:

1) Do all the work we can do now.

2) Check if any network connections need service, blocking if there's nothing to do.

3) Service any network connections discovered.

4) Go to step 1.

Typically, because you just did all the work you can do, when you come around to step 2, there is no work for you to do. So you'll have to wait a bit. Now, imagine there are 800 sockets you are interested in. The kernel has to put on the wait queue for each of those 800 sockets. And, a split-second later when data arrives on one of those 800 sockets, the kernel has to remove you from those 800 wait queues. Placing a task on a wait queue requires creating a 'thunk' to link that task to that wait queue. No good optimizations are possible because the kernel has no idea which 800 sockets you'll be waiting for.

With epoll, the epoll socket itself has a wait queue, and the process is only put on that one single wait queue. A thunk is needed to link each of the 800 connections to the epoll wait queue, but that thunk is persistent. You create it by adding a socket to an epoll set, and it remains there until you remove the socket from the set.

When there's activity on the socket, the kernel handles it in the task that detects the activity. When you wait, the kernel already knows if there's a detected event and the kernel only has to put you on that one wait queue. When you wake, it only has to remove you from that one queue.

So it's not so much the copying that's the killer with select or poll, it's the fact that the kernel has to manipulate a massive number of wait queues on each blocking operation.

Massicot answered 17/1, 2012 at 7:41 Comment(3)

Thanks for the nice write-up. A quick question: Given the use-case of 4000 processes on the system each handling about five file descriptors (FDs), would there be an improvement in the over-all system performance if epoll is used in lieu of poll? Reading your write-up on the kernel overhead involved with select/poll, I'm guessing there would be.. – Turnbuckle 15/4, 2013 at 10:23

It just occurred to me that the epoll socket will be different for each of the processes, and there would be no one wait queue for the FDs of the processes. Would it matter if I used poll or epoll in this scenario? – Turnbuckle 15/4, 2013 at 10:39

Each process would have its own epoll descriptor with its own wait queue. That would mean that each time a process had to be blocked waiting for I/O, it would only have to be placed on one wait queue instead of about five and when woken up only removed from one wait queue instead of about five. Whether that was significant would depend on the I/O load, but it would certainly be better. – Massicot 15/4, 2013 at 11:55

V

19

The poll system call needs to copy your list of file descriptors to the kernel each time. This happens only once with epoll_ctl, but not every time you call epoll_wait.

Also, epoll_wait is O(1) in respect of the number of descriptors watched¹, which means it does not matter whether you wait on one descriptor or on 5,000 or 50,000 descriptors. poll, while being more efficient than select, still has to walk over the list every time (i.e. it is O(N) in respect of number of descriptors).

And lastly, epoll can in addition to the "normal" mode work in "edge triggered" mode, which means the kernel does not need keep track of how much data you've read after you've been signalled readiness. This mode is more difficult to grasp, but somewhat more efficient.

¹As correctly pointed out by David Schwartz, epoll_wait is of course still O(N) in respect of events that occur. There is hardly a way it could be any different, with any interface. If N events happen on a descriptor that is watched, then the application needs to get N notifications, and needs to do N "things" in order to react on what's happening.
This is again slightly, but not fundamentally different in edge triggered mode, where you actually get M events with M <= N. In edge triggered mode, when the same event (say, POLLIN) happens several times, you will probably get fewer notifications, possibly only a single one. However, this doesn't change much about the big-O notation as such.

However, epoll_wait is irrespective of the number of descriptors watched. Under the assumption that it is used in the intended, "normal" way (that is, many descriptors, few events), this is what really matters, and here it is indeed O(1).

As an analogy, you can think of a hash table. A hash table accesses its content in O(1), but one could argue that calculating the hash is actually O(N) in respect of the key length. This is technically absolutely correct, and there probably exist cases where this is a problem, however, for most people, this just doesn't matter.

Valuator answered 21/3, 2011 at 21:31 Comment(5)

Thank you for the answer. But if, with epoll, i just need to copy once the file descriptors how can i change the file descriptors list from the User Space? – Frankforter 22/3, 2011 at 1:37

You add/remove/change descriptors to/from/in the set using epoll_ctl. This changes, on your request, kernel structures (e.g. it adds the eventpoll structure to a file descriptor's wait queue). When you call epoll_wait, it just blocks. When something happens on a file descriptor, it wakes that descriptor's wait queue (in reality, it's more complicated than that, there's another queue inside the eventpoll struct so several threads can block on the same fd, but basically that's it). So actually it's rather a "targetted notify" than a "searching poll", which is why it's O(1) too. – Valuator 22/3, 2011 at 8:58

Actually, epoll_wait is O(N), not O(1). The number of discovered sockets is O(N) with epoll_wait, just like it is with poll. Each of those discovered sockets has to be copied into the application's epoll_event structure. If you double the number of sockets you poll on, you're double the number of sockets expected to have activity. – Massicot 7/2, 2013 at 18:12

@DavidSchwartz: You are in some way correct, epoll_wait is O(N) and O(1) at the same time. But what's importatn is that it is O(1) for the "normal" scenario that it was designed for. The assumption is that you watch many (dozens, hundreds, thousands) descriptors and few events (usually less than 5) actually happen at a time. epoll cannot do "magic", it is not (and cannot be) O(1) in respect of the events that happen - if N events occur, then something must be done for N events. However, if N ≈ 1, that's practically as good as O(1). What matters more is ... – Valuator 8/2, 2013 at 14:29

... that it's O(1) in respect of the number of descriptors. Because if you have N ≈ 10000 then O(1) or O(N) aren't just the same thing (quite obvious). Here, it really, really matters. Your objection is correct insofar as if one uses the epoll interface in a way that is contrary to its intended use and has the expectation of "magic" happening, then one may be disappointed. In such a case, plain old poll may both easier and just as fast. I've made mention of that in this lengthy answer that talks about epoll among others (ca. in the middle). – Valuator 8/2, 2013 at 14:34

Recommended topics

Hot tags