1. Agent can create its own tools and save them to memory
2. You create a SQL (and web app?) workbench per agent run
3. Grok fell off a cliff in the last month. Was this consistent over multiple runs?
4. Agents have a difficult time backtracking. Would unwinding system state and agent context make backtracking better? (Harder to implement this, though)
5. Since each new month only uses final state from previous month, agent has no way to understand why error occurred in previous month
Cool experiment! Was it difficult building the observable SQL workbench? And how many humans-in-the-loop did you have?