While there most likely is going to be some bias in the training of those kinds of models, we can also hope that transfer learning from other non-driving videos will at least help generate something close enough to the very real but unusual situations you are mentioning. We could imagine an LLM serving as some kind of fuzzer to create a large variety of prompts for the world model, which as we can see in the article seems pretty capable at generating fictive scenarios when asked to.
As always tho the devil lies in the details: is an LLM based generation pipeline good enough? What even is the definition of "good enough"? Even with good prompts will the world model output something sufficiently close to reality so that it can be used as a good virtual driving environment for further training / testing of autonomous cars? Or do the kind of limitations you mentioned still mean subtle but dangerous imprecisions will slip through and cause too poor data distribution to be a truly viable approach?
My personal feeling is that this we will land somewhere in between: I think approaches like this one will be very useful, but I also don't think the current state of AI models mean we can have something 100% reliable with this.
The question is: is 100% reliability a realistic goal? Human drivers are definitely not 100% reliable. If we come up with a solution 10x more reliable than the best human drivers, that maybe has some also some hard proof that it cannot have certain classes of catastrophic failure modes (probably with verified code based approaches that for instance guarantees that even if the NN output is invalid the car doesn't try to make moves out of a verifiably safe envelope) then I feel like the public and regulators would be much more inclined to authorize full autonomy.
I haven't read anything about this but I would also suppose long distance human intervention cannot be done for truly critical situations where you need a very quick reaction, whereas it would be more appropriate in situations where the car has stopped and is stuck not knowing what to do. Probably just stating the obvious here but indeed this seems like something very different from an RC car kind of situation.
It’s not for that. It’s for things like the car drove into a protest area and people are surrounding the car. Or police blocked off an intersection and the car is stuck temporarily with people doing otherwise illegal u-turns or driving the wrong way on a one way road to get out of it.
Multiple config files of Xcode projects are not publicly documented as far as I remember and personally I have preferred to require my agents not to modify them out of fear it might break something and be hard to fix. I don't know how agentic programming will work in Xcode but I would expect it to do it using a safer approach, so that's also another case where it might have an advantage.
Your workflow looks very interesting especially the describe_ui part, are you already able to do this today?
I've been using tuist plus codex cli and vscode. only using Xcode for running and debugging. I would love to get rid of xcode entirely. The tuist plus xcode project is a tiny shim - the rest of the app is in spm packages.
Tuist generates the code project from a simple swift confiuration file similar to Package.swift. I don't know why apple can't just throw away the proprietary and ugly Xcode project files and provide a sane build system that would work in any IDE, on the command line, in CI and now with AI agents.
They could open up the instruments format while they are at it for the same reason, do they really gain anything by making it proprietary?
It’s a cloud service - so you can call out to it from anywhere you want. Just don’t ship your credentials in the app itself, and instead authenticate via a server you control.
The thing is that on macOS at least, Codex does have the ability use an actual sandbox that I believe prevents certain write operations and network access.
I don't know about Claude Code but in GitHub Copilot as far as I can tell the subagents are just always the same model as the main one you are using. They also need to be started manually by the main agent in many cases, whereas maybe the parent comment was referring about calling them more deterministically?
Copilot is garbage, even MSFT employees I know all use cc. The only thing useful is you can route cc to use models in copilot sub which corp had a deal from their m365
My theory, based on what I would see with non-thinking models, is that as soon as you start detailing something too much (ie: not just "speak in the style of X" but more like "speak in the style of X with [a list of adjectives detailing the style of X]" they would loose creativity, would not fit the style very well anymore etc.
I don't know how things have evolved with new training techniques etc. but I suspected that overthinking their tasks by detailing too much what they have to do can lower quality in some models for creative tasks.
An hybrid approach could maybe work, have a more or less standard game engine for coherence and use this kind of generative AI more or less as a short term rendering and physics sim engine.
I've thought about this same idea but it probably gets very complicated.
Let's say, you simulate a long museum hallway with some vases in it. Who holds what? The basic game engine has the geometry, but once the player pushes it and moves it, it needs to inform the engine it did, and then to draw the next frame, read from the engine first, update the position in the video feed, then again feed it back to the engine.
What happens if the state diverges. Who wins? If the AI wins then...why have the engine at all?
It is possible but then who controls physics. The engine? or the AI? The AI could have a different understanding of the details of the base. What happens if the vase has water inside? who simulates that? what happens if the AI decides to break the vase? who simulates the AI.
I don't doubt that some sort of scratchpad to keep track of stuff in game would be useful, but I suspect the researchers are expecting the AI to keep track of everything in its own "head" cause that's the most flexible solution.
Then maybe the engine should be less about really simulating the 3D world and just trying best to preserve consistency, more about providing memory and saving context for consistency than truly simulating a lot besides higher level concerns (at which point we might wonder if it couldn't be directly part of the model somehow), but writing those lines I realize there would probably still be many edge cases exactly like what you are describing...
As always tho the devil lies in the details: is an LLM based generation pipeline good enough? What even is the definition of "good enough"? Even with good prompts will the world model output something sufficiently close to reality so that it can be used as a good virtual driving environment for further training / testing of autonomous cars? Or do the kind of limitations you mentioned still mean subtle but dangerous imprecisions will slip through and cause too poor data distribution to be a truly viable approach?
My personal feeling is that this we will land somewhere in between: I think approaches like this one will be very useful, but I also don't think the current state of AI models mean we can have something 100% reliable with this.
The question is: is 100% reliability a realistic goal? Human drivers are definitely not 100% reliable. If we come up with a solution 10x more reliable than the best human drivers, that maybe has some also some hard proof that it cannot have certain classes of catastrophic failure modes (probably with verified code based approaches that for instance guarantees that even if the NN output is invalid the car doesn't try to make moves out of a verifiably safe envelope) then I feel like the public and regulators would be much more inclined to authorize full autonomy.
reply