Interesting, the separate compute and storage tiers is another system going that...

fomichev3000 · on April 19, 2022

Each tablet gather a quorum of answers from members of so called BlobStorage group. BlobStorage group is a number of so called VDisks (virtual disk), all VDisks run on different nodes (even on different fail domain like racks, AZs). VDisk stores its data on physical device, i.e. PDisk.

anikuni · on April 20, 2022

From my past experience, Datomic uses the same approach, ie, multiple reader nodes and a single transact node. However, it's much more locked in with AWS, as it uses Dynamo and S3 for backing (maybe others as well?).

jeffbee · on April 19, 2022

Assigning leaders is trivial with something like zookeeper. But in this case it appears that the leader metadata is stored in a table of the database itself, which raises questions of operability if those tablets are unavailable.

fomichev3000 · on April 19, 2022

YDB doesn't use Zookeeper. The system is built of tablets, every tablet implements distributed consensus algorithm. There are different types of tablets in the system, say SchemeShard is tablet that stores metadata, table schema for instance. DataShard stores table partition data.

asadawadia · on April 19, 2022

Which consensus algorithm? I don't see raft in the codebase

fomichev3000 · on April 20, 2022

We have consensus protocol over shared log. Some details you can find in a power point presentation at this page https://2019.hydraconf.com/2019/talks/muxomfgqembsb3st7i3ci/