Testing LLM Agents Like Software – Behaviour Driven Evals of AI Systems

mlop99 · 2025-11-04T18:05:18 1762279518

Curious if the behaviour driven testing can be done by another LLM agent (or a group of agents) - one LLM agent testing another. Could lead to a self-improving loop?

shailendra145 · 2025-11-04T17:48:05 1762278485

A powerful move beyond benchmarks — this paper redefines LLM evaluation through realistic, behavior-driven testing.

jlukecarlson · 2025-11-04T22:48:00 1762296480

I appreciate the details shared in this paper but it'd be great if they open sourced their implementation!

papz2k · 2025-11-04T17:46:21 1762278381

Very interesting work.

ajay_shastry · 2025-11-05T08:28:05 1762331285

Intresting work

raj_maddipati · 2025-11-04T20:20:52 1762287652

Excellent work

harshv_03 · 2025-11-04T19:49:16 1762285756

Interesting

ankush9812 · 2025-11-04T17:50:28 1762278628

Nice Work

ashyash518 · 2025-11-04T17:41:08 1762278068

Nice work

saurabh_xen · 2025-11-04T17:40:09 1762278009

Great work

quanta9 · 2025-11-04T17:41:19 1762278079

interesting