At least in what I do, I find 80% of my parallelism needs covered by pool.map/po...

At least in what I do, I find 80% of my parallelism needs covered by pool.map/pool.imap_unordered. Of the remaining 20%, 80% can mostly be solved by communicating through queues or channels (though admittedly this is smoother in Erlang or Rust than in Python).

Of course that's not true for everything, and depending on the domain tree dataflows can also be great. I remember them being very popular in GPGPU tasks because synchronization is very costly there.