Why doesn't NGINX implement work stealing? Wouldn't that help?

tyingq · on Oct 24, 2017

You can control of the queue, but not the entries. You can pass the whole queue/socket around via sendmsg(), but not single entries in the queue. This is hard to solve well in user space.

derefr · on Oct 24, 2017

So, what would a non-userspace solution look like? An HTTP keepalive + pipelining + HTTP2 implementation in the kernel that forwards the demuxed messages as separate packets onto a specially-typed socket, such that a prefork daemon can accept(2) individual HTTP request packets?

tyingq · on Oct 24, 2017

There are a few problems outlined in the article. I was referring to this one:

"epoll() seem to have LIFO behavior which can result in uneven load balancing across workers (in case of accept()ing from a shared socket)"

Which is unrelated to keepalives and pipelining and best addressed in the kernel.

wahern · on Oct 24, 2017

You can accept a socket and pass it to worker processes/threads using whatever scheme desired.

tyingq · on Oct 24, 2017

There are a few problems mentioned in the article, but the main one is before that. It's how to spread the load of the accept() on a single shared socket across more than one worker in a balanced way.

wahern · on Oct 24, 2017

Right, they use EPOLLEXCLUSIVE then complain when no other waiters are woken up. Thundering herd is a problem when you're using hundreds of threads. If you have a smaller number of waiters it's irrelevant. Moreover, under load only a few waiters, if any, will be sleeping on the queue when a new connection comes in, so there won't be any herd at all.

The round-robin patches seemed to be stalled precisely because everyone is bickering over solutions to problems that they've partly created for themselves. They've gone down the rabbit hole and disappeared.

If you want strongly fair scheduling and latency guarantees, just dequeue the sockets and assign them however you want. You introduce a small amount of latency, but you'd get similar latency by enforcing round-robin behavior in the kernel. LIFO is the effective behavior precisely because it's faster for the already running process to dequeue the next event than to park the running process and fire up a sleeping process.

Everybody talks about Windows IOCP, but Windows does precisely this same thing: a pool of threads with a simple scheduling scheme that uses similar asynchronous polling interfaces--just not interfaces exposed to user processes. I suppose it's a PITA to implement this in user space compared to the prepackaged solution Windows provides, but then again the Unix/Linux model is to make it easy to implement these solutions yourself, as opposed to providing one solution without exposing the underlying interfaces. Remember, scheduling, locking, and IPC primitives are much faster on Linux than on Windows. That's not coincidental. If you're not prepared to roll your own--if you want the vendor do you all the work for you--don't use Linux. The best features (and feature additions) in the Unix universe aren't oriented toward specific production problems, but toward interfaces that make it easier to roll your own solutions. Adding more flags to epoll long ago passed the point of diminishing returns.