It can be solved with speaker segmentation/embedding models, although it is not perfect. One thing we do with Hyprnote is that we have a Descript-like transcript editor that allows you to easily edit/assign speakers. Once we integrate a speaker diarization model with that, I think we'll be in good shape.
If you are interested, you can join our Discord and follow updates. :) https://hyprnote.com/discord