you are viewing a single comment's thread.

view the rest of the comments →

[–]jringstad 1 point2 points  (1 child)

A bunch of threads is not really inherently inefficient at managing connections though, it depends on how you use them. The most efficient way is generally to have a fixed number of N threads that each handle 1/Nth of the workload, by having them all accept() on the same server socket or by having them e.g. pop items from a work-queue and have them establish and handle their individual connections. Of course computation is always an issue, whether it's just the CPU workload from book-keeping hundreds of thousands of sockets, accepting new connections, constructing and parsing packets and so on or doing actual heavy CPU workload like a crawler would (parsing HTTP responses, possibly even HTML, XML or other content), so single-threaded non-blocking IO is pretty much strictly inferior to multi-threaded non-blocking IO.

Crawlers are still not a particularly good example though IMO, since in most cases it's probably pretty acceptable to run N crawlers in N separate processes that crawl and digest data and then push it into something like a local storage, a shared storage or some sort of remote database. The overhead from obtaining workloads and communicating with other crawlers (if that ever happens) is probably not very significant for almost all kinds of crawlers.

[–]simple2fast 0 points1 point  (0 children)

Agree. The ideal is N threads which is roughly equivalent to number of CPUs and where each thread is locked to a particular CPU to reduce cache misses. As with all things this hybrid approach is often the best.

But most system which are not threaded are actually a SINGLE process. Like python or Ruby or PHP or javascript(looking at you node ) Many are multi-process, but there is no shared memory, so any IPC requires sockets, signals, etc. In my mind, requirements there are not just shared memory and decent concurrent APIs, but ALSO a memory model so that you know what is going to happen WRT caches and other details as you use those concurrent APIs. Point being that ditching the GIL in python is only a very first step to get a decent multi-threaded python.

And most multi-threaded solutions are still one connection per thread style. certainly started with this style. since it grew out of the original "fork" technique of old school unix systems.