Sure, CUDA has a lot of highly optimized utilities baked-in (CUDNN and the likes...

Sure, CUDA has a lot of highly optimized utilities baked-in (CUDNN and the likes) and maybe more importantly, implementors have a lot of experience with it but afaict everyone is working on their own HAL/compiler and not using CUDA directly to implement the actual models. It's part of the HAL/framework. You can probably port any of these frameworks to a new hardware platform with a few man-years worth of work imo if you can spare the manpower.

I think nobody had the time to port any of these architectures away from CUDA because: * the leaders want to maintain their lead and everyone needs to catch up asap so no time to waste, * and progress was _super_ fast so doubly no time to waste, * there was/is plenty of money that buys some perceived value in maintaining the lead or catching up.

But imo: 1. progress has slowed a bit, maybe there's time to explore alternatives, 2. nvidia GPUs are pretty hard to come by, switching vendors may actually be a competitive advantage (if performance/price pans out and you can actually buy the hardware now as opposed to later).

In terms of ML "compilers"/frameworks, afaik there's:

* Google JAX/Tensorflow XLA/MLIR, * OpenAI Triton, * Meta Glow, * Apple PyTorch+Metal fork.