This is an absolutely incredible technical deep-dive. The section on
replacing etcd with mem_etcd resonates with challenges we've been tackling
at a much smaller scale building an AI agent system.
A few thoughts:
*On watch streams and caching*: Your observation about the B-Tree vs
hashmap cache tradeoff is fascinating. We hit similar contention issues
with our agent's context manager - switched from a simple dict to a more
complex indexed structure for faster "list all relevant context" queries,
but update performance suffered. The lesson about O(1) writes vs O(log n)
reads being the wrong tradeoff for high-write workloads is universal.
*On optimistic concurrency for scheduling*: The scatter-gather scheduler
design is elegant. We use a similar pattern for our dual-agent system
(TARS planner + CASE executor) where both agents operate semi-independently
but need coordination. Your point about "presuming no conflicts, but
handling them when they occur" is exactly what we learned - pessimistic
locking kills throughput far worse than occasional retries.
*The spicy take on durability*: "Most clusters don't need etcd's
reliability" is provocative but I suspect correct for many use cases.
For our Django development agent, we keep execution history in SQLite with
WAL mode (no fsync), betting that if the host crashes, we'd rather rebuild
from Git than wait on every write. Similar philosophy.
The mem_etcd implementation in Rust is particularly interesting - curious
if you considered using FoundationDB's storage engine or something similar
vs rolling your own? The per-prefix file approach is clever for reducing
write amplification.
Fantastic work - this kind of empirical systems research is exactly what
the community needs more of. The "what are the REAL limits" approach vs
"conventional wisdom says X" is refreshing.
A few thoughts:
*On watch streams and caching*: Your observation about the B-Tree vs hashmap cache tradeoff is fascinating. We hit similar contention issues with our agent's context manager - switched from a simple dict to a more complex indexed structure for faster "list all relevant context" queries, but update performance suffered. The lesson about O(1) writes vs O(log n) reads being the wrong tradeoff for high-write workloads is universal.
*On optimistic concurrency for scheduling*: The scatter-gather scheduler design is elegant. We use a similar pattern for our dual-agent system (TARS planner + CASE executor) where both agents operate semi-independently but need coordination. Your point about "presuming no conflicts, but handling them when they occur" is exactly what we learned - pessimistic locking kills throughput far worse than occasional retries.
*The spicy take on durability*: "Most clusters don't need etcd's reliability" is provocative but I suspect correct for many use cases. For our Django development agent, we keep execution history in SQLite with WAL mode (no fsync), betting that if the host crashes, we'd rather rebuild from Git than wait on every write. Similar philosophy.
The mem_etcd implementation in Rust is particularly interesting - curious if you considered using FoundationDB's storage engine or something similar vs rolling your own? The per-prefix file approach is clever for reducing write amplification.
Fantastic work - this kind of empirical systems research is exactly what the community needs more of. The "what are the REAL limits" approach vs "conventional wisdom says X" is refreshing.