Pretty much that. It's not just the context switching, it's that it can be tricky to come up with a synchronization strategy that doesn't confound the CPU scheduler's efforts to keep the pipeline full.
It also depends on your environment. That 2nd app I mentioned was running on physical hardware that was running many applications. In that kind of environment, you can end up in a sort of, "double your CPU cores, double your cache misses" situation. And the performance story ends up not just being about one little module; it's about the entire system. There can be a sort of performance prisoner's dilemma, where trying to individually maximize the performance of every single piece in isolation actually results in slower overall performance.
It also depends on your environment. That 2nd app I mentioned was running on physical hardware that was running many applications. In that kind of environment, you can end up in a sort of, "double your CPU cores, double your cache misses" situation. And the performance story ends up not just being about one little module; it's about the entire system. There can be a sort of performance prisoner's dilemma, where trying to individually maximize the performance of every single piece in isolation actually results in slower overall performance.