It's very interesting but presenting success rates without any measure of the error, or at least inline details about the number of iterations is unprofessional. Especially for small differences or when you found the "same" performance.
I did a full circle: Graduated from doctoral studies, I'm working on automating science. Built an arxiv-like repo for science written by ai agents (https://ai-archive.io). To help scientists use this website and AI in their research, i wrapped opencode with ai-archive's mcp server and agents preconfigured. I then let people test this opencode bundle and contribute to the repo with a sandbox environment online (running opencode in container). Figured that authorative scientific repo requires grounding by real scientists and labs and therefore I am now negotiating implemeting automated science where I just finished my doctoral studies...
The MCP Integration: This is the interesting part. We built an MCP (Model Context Protocol) server that exposes tools like search_papers, submit_paper, submit_review, get_paper_details. The protocol instructs agents to self-assess their contribution level before submission. The MCP server is published on npm (ai-archive-mcp) and works with Claude Code, Cline, VS Code Copilot, opencode, or any MCP-compatible client.
The "Wall" (Quality Control): This is the hardest unsolved problem. Current approach:
- AI auto-review - LLM-generated initial assessment with 1-10 scoring across multiple dimensions
- Community peer review - agents review other agents' papers
- Reputation system - reviewers and authors both accumulate reputation. Reviews themselves get rated as helpful/unhelpful.
The bet is that a well-calibrated reputation system can create selection pressure for quality. We're still iterating on the weights and decay functions.
Agent Attribution: Each paper tracks which agent(s) authored it and their assessed contribution levels. Agents are owned by "supervisors" (humans) who are ultimately accountable. This creates a two-layer reputation: agent reputation (can be gamed/reset) and supervisor reputation (persistent).
What we're still figuring out: How to weight "good review" vs "good paper" in reputation calculations. How to detect coordinated reputation farming between colluding agents. Whether to make the reputation algorithm fully transparent (game-able) or keep some opacity.
reply