Should I use epoll or just blocking recv in threads?
Asked Answered
K

2

9

I'm trying to write a scalable custom web server. Here's what I have so far:

The main loop and request interpreter are in Cython. The main loop accepts connections and assigns the sockets to one of the processes in the pool (has to be processes, threads won't get any benefit from multi-core hardware because of the GIL).

Each process has a thread pool. The process assigns the socket to a thread. The thread calls recv (blocking) on the socket and waits for data. When some shows up, it gets piped into the request interpreter, and then sent via WSGI to the application running in that thread.

Now I've heard about epoll and am a little confused. Is there any benefit to using epoll to get socket data and then pass that directly to the processes? Or should I just go the usual route of having each thread wait on recv?

PS: What is epoll actually used for? It seems like multithreading and blocking fd calls would accomplish the same thing.

Kiyokokiyoshi answered 9/9, 2011 at 6:58 Comment(0)
F
11

If you're already using multiple threads, epoll doesn't offer you much additional benefit.

The point of epoll is that a single thread can listen for activity on many file selectors simultaneously (and respond to events on each as they occur), and thus provide event-driven multitasking without requiring the spawning of additional threads. Threads are relatively cheap (compared to spawning processes), but each one does require some overhead (after all, they each have to maintain a call stack).

If you wanted to, you could rewrite your pool processes to be single-threaded using epoll, which would reduce your overall thread usage count, but of course you'd have to consider whether that's something you care about or not - in general, for low numbers of simultaneous requests on each worker, the overhead of spawning threads wouldn't matter, but if you want each worker to be able to handle 1000s of open connections, that overhead can become significant (and that's where epoll shines).

But...

What you're describing sounds suspiciously like you're basically reinventing the wheel - your:

  1. main loop and request interpreter
  2. pool of processes

sounds almost exactly like:

  1. nginx (or any other load balancer/reverse proxy)
  2. A pre-forking tornado app

Tornado is a single-threaded web server python module using epoll, and it has the capability built-in for pre-forking (meaning that it spawns multiple copies of itself as separate processes, effectively creating a process pool). Tornado is based on the tech created to power Friendfeed - they needed a way to handle huge numbers of open connections for long-polling clients looking for new real-time updates.

If you're doing this as a learning process, then by all means, reinvent away! It's a great way to learn. But if you're actually trying to build an application on top of these kinds of things, I'd highly recommend considering using the existing, stable, communally-developed projects - it'll save you a lot of time, false starts, and potential gotchas.


(P.S. I approve of your avatar. <3)

Forwhy answered 9/9, 2011 at 7:3 Comment(5)
Thanks. And yes, I do have a tendency to re-build already existing software under the idea that, since I wrote it, it's easier to modify later. Plus, after all these years I still haven't got the hang of using the build systems all those OS projects use (i.e., anything more complicated than a Makefile).Kiyokokiyoshi
Just try to keep in mind en.wikipedia.org/wiki/Not_Invented_Here and en.wikipedia.org/wiki/YAGNI - a lot of the time, you won't need to modify it later (or if you do, it's probably the wrong answer for the problem).Forwhy
As for the "build systems" problem - until you get to really huge infrastructures (at which point, you have a person or team specifically to build packages for you), consider just using the pre-built versions available via package managers in most nix distros. For instance, sudo apt-get install nginx python-tornado will get you working copies of both nginx and tornado on any modern Ubuntu install. As mentioned at YAGNI above, you probably don't *really need to customize the build options for things like servers in most cases.Forwhy
(And, in the few cases where you do wind up running into software that you need to build from scratch, well... that's exactly what StackOverflow and ServerFault are there to help you with if you get stuck! :))Forwhy
@Amber, it you said spawn new thread making overhead, then do you have any advise for this link #52024948Outstretched
S
1

The epoll function (and the other functions in the same family poll and select) allow you to write single threading networking code that manage multiple networking connection. Since there is no threading, there is no need fot synchronisation as would be required in a multi-threaded program (this can be difficult to get right).

On the other hand, you'll need to have an explicit state machine for each connection. In a threaded program, this state machine is implicit.

Those function just offer another way to multiplex multiple connexion in a process. Sometimes it is easier not to use threads, other times you're already using threads, and thus it is easier just to use blocking sockets (which release the GIL in Python).

Stome answered 9/9, 2011 at 7:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.