Thank you – You are right that these are very important topics, and we also had ...

Thank you – You are right that these are very important topics, and we also had to expend a lot of work at Cruise to scale training beyond single node. We had training jobs running over dozens of GPU nodes for many days. For example, we had a dedicated team to optimize streaming of training data into PyTorch dataloaders. This evidently requires more infrastructure, and also many features around fault tolerance, checkpointing, warm restarts, etc.

We are a very new framework (launched publicly July 1st :-), so there is much work to be done to cover many more example use cases.

What we have found powerful about this plain function approach is that users can submit jobs on remote platforms (e.g. Spark, Google Dataflow, etc.), and use heterogenous resources (e.g. standard nodes to launch third-party jobs, then GPU nodes for training, etc.). So whatever "cloud provider X's data solutions" you have to use, if it has a Python API to submit and wait for jobs, you should be fine.