No that's not the proper way of doing it. You create worker threads in separate process that receive data from node.js, encodes them and send them back. you don't fork on every request.
Where is he suggesting forking on every request? My read sounds like he's suggeting the main process handles web requests and a single other process to handle encoding?
The suggested approach is to separate the I/O bound task of receiving uploads and serving downloads from the compute bound task of video encoding.
I'm assuming by using something like child_process.fork to create a video encode queue separate from the main event loop.
Once your doing that why not just use threads and get rid of the IPC?
That's kind of the point of the Node.JS bashing, once you work around all it's pitfalls you're right back where you started except your now writing your app in a language unsuited for the purpose.
Node solves the problem of needing to write evented servers in javascript. Beyond that I can't see much advantage in it vs. existing languages. If I wrote something called "Node.NET" which was a JScript wrapper around completion ports and went around telling everyone that this was the future of webdev... what do you think the reaction would be?
fork+exec is even more expensive than fork. (Although in the case of video encoding the overhead is negligible. Problems with Node.js are more likely to be seen when an event sometime takes 100-1000 ms; it's slow enough to hurt response times of other requests but maybe not worth farming out to another process.)
now writing a high-throughput server is simply a matter of writing a high-throughput worker to generate fibonacci numbers and connecting it reliably through interprocess communication to your node.js shim.
thank god node.js was there to save me all that work.
1. Process creation and termination are heavy operations considering that it's being done in a tight loop.
2. You don't want to fork on every request, this is very vulnerable to fork bombs.
3. In worker threads model, you already have the work threads spawned and ready for crunching, which will reduce the system load because you aren't forking on every request.
From an OS POV both of those tests are doing almost exactly the same thing: make a system call, have the kernel spawn a new unit of execution (whether it's a thread or a process), wait for the child to be scheduled, have it terminate, wait for the operating system to notify the parent process of termination. There's a little extra bookkeeping in this example for the child process, but not much at all thanks to copy on write. If you were to do something dumb from the child work like memsetting a 1MB buffer to 1, I assume the child process would be far slower due to page faults.
There are certainly advantages and disadvantages to both threads and processes, but it's not really a fair comparison to claim that processes are as fast or faster than threads because you can spawn them at a certain rate. The performance cost of separate processes is something you pay gradually, every time you have to take a page fault and copy 4KB.
The post you link to spawns a new thread and tests that speed. When you have a pool it takes roughly zero ms to tell it to do something. All you have is whatever locking is required to synchronize between the enqueueing (request processor) and worker threads, which may even just be a volatile.
Then again, as someone else said, the amount of time it takes to spawn a process to encode a video relative to the amount of time it takes to do the encode is probably trivial, and you would benefit from having the process isolation in case something goes wonky.
You would probably write some sort of "process gate". Though in a distributed architecture I'd do this with some sort of distributed work queue.... I did this in .NET many years ago for a similar service: http://blog.jdconley.com/2007/09/asyncify-your-code.html
I suspect those numbers have nothing to do with the cost of the system creating a thread or a process, but instead are artifacts of how Python handles threads.
When Python forks a process using the multiprocess module, that process can execute concurrently with the parent process. On a mutlicore machine, it can be simultaneous.
When Python spawns a thread, the thread and the parent process cannot execute concurrently. They need to grab the Global Interpreter Lock (GIL). Whoever has it can execute. Whoever does not must wait.
So, I suspect that what we are seeing is that even though the new processes/threads have very little work, the processes can exit faster because they don't have to wait for the parent process to give up the GIL. This is a misguided experiment.
Yes, I suspect that you're right; I also thought it might be due to the GIL.
I reran this test with Python 2.7 and it no longer appears to be true:
Spawning 100 children with Thread took 0.03s
Spawning 100 children with Process took 0.28s
I'm not sure to what extent the GIL was improved in 2.7, but it's possible that it was never the cause to begin with.
Regardless, I don't think it's a misguided experiment – it was an objective observation. It shows that things aren't so black and white depending on your toolchain.
I think the experiment is misguided for several reasons. One, process/thread creation time is negligible. The general approach is to create worker threads/processes that live for the lifetime of the program. Then you farm work out to them as needed. This separates the concept of "doing work" from their actual execution.
Two, threads don't buy you parallelism in Python, unless the majority of the work is being done in C modules.
Finally, this test is really just testing the multiprocess and thread packages provided by Python. I say this is misguided because the way the author talks about it, I don't think he understands that the difference between those abstractions and OS threads and processes. (Which, of course, are an abstraction as well.) I suspect the Python overhead will be more than the difference in cost between forking OS-level threads and processes.