It's highly dependent on both workload and instruction set. I'm sure it could be...

It's highly dependent on both workload and instruction set.

I'm sure it could be automatically optimized in theory, even without the solution being AI complete, but I don't think we have any idea how to do it right now.

No, not unless you're reflashing an FPGA. You'd have better luck sharing subcores for threadlets I think.