This is a very similar to one of my projects "Got". https://github.com/gotvc/got...

kamranjon · on Jan 19, 2023

"The algorithms it uses are superior to rsync and git in a few ways." - Would you mind explaining this?

brendoncarroll · on Jan 19, 2023

Sure, Git stores data in a trie. Each file is one blob identified by hash, and directories (called trees in Git) are blobs where each line is a directory entry with a name and the hash of a file or another tree. This means that modifying an object /down/a/long/path/like/this.txt has to create copies of all the trees on the way up. The technical term for this is "write amplification", and in Git it is affected by path length among other things.

Got stores data in a probabilistic tree (GotKV[0]). The number of nodes before you get to data will scale logarithmically with the size of the entire filesystem, not the depth of a specific object.

Then there is the issue of large files. A file in Git is always 1 blob. Syncing a large blob is not easy because if you are interrupted and have to restart, you have lost all your progress. You can't verify the hash of a blob until you have the whole thing. Got has a maximum blob size, so you'll only be buffering <2MB at a time before you can verify that the blob is correct. If a transfer is interrupted, the most you'll have to repeat is one blobs worth, plus any tree nodes above that blob.

Compared to rsync, Got uses variable size chunks and a faster content defined chunking algorithm, recently featured here on HN[1]. I haven't thought about if variable vs fixed chunks is better for file transfer, but for version control, the higher chance of convergence is important. It means you have better deduplication.

[0] https://github.com/gotvc/got/tree/master/pkg/gotkv

[1] https://news.ycombinator.com/item?id=34303497

Zamicol · on Jan 19, 2023

I've not heard the term "probabilistic tree" and I've having difficulty pulling up references. I suspect it's implemented by subpackage ptree[0]. Could you explain what makes probabilistic trees different from hash tables or other similar data structures?

[0] https://github.com/gotvc/got/tree/master/pkg/gotkv/ptree

brendoncarroll · on Jan 19, 2023

Yep, your link is indeed to the probabilistic tree used in GotKV.

Here "probabilistic" just refers to a way of balancing a tree. Rather than having a set of rules to keep the tree balanced, like with a btree or red-black tree, balancing decisions are made pseudorandomly. The result is that the tree is very likely to be balanced, and is unlikely to be unbalanced.

In the case of GotKV's tree: the entries are stored together in a stream, and for each entry a hash is computed. If that hash is lower than a certain value then the entry is considered a split point, and a tree node is created. So now we have a stream of entries, divided probabilistically into sections. Each section is a tree node. Now take references to those nodes and turn them into entries, and repeat the process, so you have fewer nodes. That continues until you have one node, which is the root. This technique is very similar to content defined chunking, and some probabilistic trees are implemented using content defined chunking on their record format, rather than a pseudorandom value calculated per entry, as in GotKV.

For those unfamiliar with probabilistic data structures, I highly recommend trying to understand skip-lists first. At least why they are balanced.

https://en.wikipedia.org/wiki/Skip_list

As an aside, one of the neat things about GotKV is that the keys are delta-encoded. Adding or removing a prefix from every key in the tree is a constant time operation. This might be obvious to some of the database folks out there, but it's a fun mind-blower if you haven't encountered the technique before.

Zamicol · on Jan 19, 2023

So it is a synonym for a hash tree? https://en.wikipedia.org/wiki/Hash_tree_(persistent_data_str...

vnorilo · on Jan 19, 2023

I was thinking of treaps;

https://en.m.wikipedia.org/wiki/Treap

hinkley · on Jan 19, 2023

Some, but not all, treaps have a node weight that is updated in a probabilistic fashion. The act of balancing the tree is still deterministic, but the weights of each node are randomized.

I keep trying to find a use for treaps, but haven't had a project that needed it. In particular, the value of a balanced tree is in consistent cost of lookups for arbitrary elements. But if you are mixing entries that are accessed often with those that are not, having an 8:1 access time ratio between the two would be a feature not a bug.

vnorilo · on Jan 20, 2023

I used a persistent treap for a lock free priority queue (swap in a new root at insertion). It felt nice but to be honest, didn't do a comprehensive comparison to alternate implementation strategies.

edit: looks like at least one other has has the same idea https://github.com/lthibault/treap

toyg · on Jan 19, 2023

This is the awesome content I still come to HN for. Thanks for the post and good luck with Got.

dperalta · on Jan 19, 2023

Thank you for such a detailed response!

beagle3 · on Jan 19, 2023

FYI: "got" is also the name of the OpenBSD re-implementation of git (supposedly, an acronym, "game of trees"): https://gameoftrees.org/

red-iron-pine · on Jan 19, 2023

in the game of trees you either live, or branch

zdgeier · on Jan 19, 2023

Ah awesome! I'll check this out

zdgeier · on Jan 19, 2023

Took a look and this is really cool. Will definitely keep this is mind. One major difference and area I'm focused on in this project is hosting. I would argue that there are currently much better VCS options for projects than Git (like yours and Fossil) but the reason these haven't taken off is that Github and Gitlab offer unmatched hosting and collaboration tools for these projects. Would be curious to know your thoughts/approach to this!

brendoncarroll · on Jan 19, 2023

My approach to hosting with Got has been to make it easy and secure for users to host from any machine.

INET256 solves that problem nicely. If you have access to an INET256 network, then all you have to do is swap addresses and two Got instances can communicate.

https://github.com/inet256/inet256

Also, end-to-end encryption is table stakes. Any data that leaves the user needs to be encrypted in transit, and if it hangs around away from the user, at rest.