I have done a similar benchmark; it is actually the MVars that kill you with respect to memory. If you can omit them, you can do way more than 3 million threads.
(I have not had enough coffee today, so I can't say how useful this setup would be. I have done threads-without-mvars for a TCP server, but my OS refused to give me any more than 300,000 pipes. Disappointing.)
(I have not had enough coffee today, so I can't say how useful this setup would be. I have done threads-without-mvars for a TCP server, but my OS refused to give me any more than 300,000 pipes. Disappointing.)