Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was just talking about this problem yesterday (only to say that it is a problem. I don’t have any solutions).

I was thinking in the context of a pijul/categorical theory of patches model rather than a CRDT that resolves conflicts deterministically (in the pijul model all merges are successful and well-behaved in that the order of mergin doesn’t matter but your data structure may contain representations of merge conflicts).

Fundamentally, I think the problem is that for a good AST model, you want to be able to handle transformations of the following forms:

  e <-> (e)
  if(e) {f; g} <- if(e) f -> if(e) {h; f}
  a (b c) d <-> a b c d
  (a b) (c d) <-> (a b c d)
The second line is meant to ask: how would you merge people independently making the middle-to-left and middle-to-right changes with your CRDT. I would like to point out particularly the changes of nesting depth and the merging and splitting of subsequences.

Let me briefly describe the pijul/CToP model. You start with a simple model of files: each file is a totally ordered set of lines, and a model of patches: a patch is an injection taking the lines of one file to equal lines in another file (i.e. it isn’t changing the contents of the line) and furthermore it is increasing which means the relative order is maintained. New lines are points not in the image of the function and deleted lines are represented by pseudo-lines in the new file with a “deleted” tag. This then forms a category with possible files being the objects and patches the morphisms. You then do some magic to figure out how to add all push outs to the category. Push outs correspond to merges (more formally, the push out of the span B <-f- A -g-> C is an object P with morphisms p: B->P, q: C->P such that the compositions fp and gq are equal and the universal property that for any other object Q and morphisms rs satisfying the same properties there is a unique morphism u: P->Q such that fpu = fr = gs = gqu, which in our case means that the merge is fair in the sense that it hasn’t accidentally resolved a conflict that was ambiguous). The result is a model where files are partially ordered sets of lines and patches are just (strictly) increasing functions. If two lines do not have an order relative to each other, that is the representation of a merge conflict.

I don’t have a good model for how that sort of thinking could extend to ASTs. Possibly an OT-in-CRDT model could cope with Emacs paredit style transformations of the tree better than this. I’m also unconvinced that solving tables would solve ASTs or vice versa. The constraints on tables feel different to me (no nesting but global constraints on the length of each row or column. Maybe allowing irregular tables where some cells are merged makes it harder). And a regular table feels like it could better fit into a set-with-extra-structure model (like total orders for each row and column or something somehow like that).

Anyway, I’ll be curious to see if someone ever comes up with a really good model to this and, if so, how heuristic-heavy it needs to be (either in the merging or the choice of diff representation) to get good results.



I've thought a little bit about those when working on Pijul, there are multiple issues, mostly at the interface between the technical and the human:

- There is no good diff for trees. Some formulations of the problem are NP-complete, meaning that patches won't be minimal, and hence users will get unexpected conflicts.

- Most VCS users have learned to be afraid of their favourite tool, and treat it like something extremely fragile that can break at any time for unknown reasons. Words like "porcelain" show that feeling is even present in the tool's authors themselves. For that reason, the more layers and heuristics one adds on top of basic tools, the scarier it becomes (especially for experienced users).

Writing a VCS is hard and takes way more time than most people expect. Few applications share the stateful nature of VCSs, one needs to feel the pain of designing and implementing databases and/or filesystems in order to understand these systems, where any bug can mean the state becomes corrupt, data is lost, etc. which is the very thing the system you're writing is meant to prevent. So, that kind of effort will only make sense when there are enough users to justify it and support the development, in terms of both man-hours and money.


You didn’t mention the lack of a good theory for tree diffs/patches here. Is that because you think there is a good theory or because you think not having a theory doesn’t matter? Or just that it wasn’t worth mentioning?

It strikes me that another big difference between dealing with conflicts in DVCSes and in the kind of collaborative editing CRDTs people seem to be mostly thinking about in the article and the comments section here, is that merges happen on the scale of seconds in the latter case and potentially much longer in the former, and the cost of a bad merge probably increases with its age.

I’m very sympathetic to your point about people not seeing the complexity in VCSs. I wonder if it is that they are frequently used tools that seem to be simple, though I don’t see that behaviour with operating systems. Perhaps they are considered too sacred or something. I think most programmers aren’t stressing enough file systems enough to find that they are sometimes buggy and hard to deal with accurately and efficiently, and if pressed might mumble something about calling fsync. But maybe most people don’t think it is easy and only those people who do write their opinions on the matter down.


> You didn’t mention the lack of a good theory for tree diffs/patches here. Is that because you think there is a good theory or because you think not having a theory doesn’t matter? Or just that it wasn’t worth mentioning?

I'm convinced there is a good theory. Moreover, since line graphs (i.e. totally ordered bytes in a file) are just a particular case of trees, that theory could even be implemented in the same way Pijul is now, where the "easy" case just means blobs linked together.

But I'm also convinced that there are so many edge cases that I wouldn't want to even start considering that before Pijul sees some real-world usage at a decent scale. If everybody misses the point and ends up preferring Git, then why even bother?

> though I don’t see that behaviour with operating systems. Perhaps they are considered too sacred or something.

I guess everybody sees how they would start to write a VCS: just read files, maybe write diff, imagine some datastructures to store that, fix the bugs and problems, that's it. Fewer people can see how to write operating systems, and even people who follow hobby tutorials immediately realise the complexity of handling all the hardware.

> I think most programmers aren’t stressing enough file systems enough to find that they are sometimes buggy and hard to deal with accurately and efficiently, and if pressed might mumble something about calling fsync.

That's right, but on the other hand if you look at the "massively parallel, structured filesystems", called databases, the space is tiny:

- There's a multi-billion-dollars company (Oracle) built at a time where nobody wanted to write fast databases.

- An academic project (Postgres) which grew extraordinarily slowly, and somehow managed to survive as a research platform, funded by public funds.

- Two single-author projects led by extraordinarily motivated people, one of which isn't meant to support concurrency (SQLite), and MySQL/MariaDB, built because there was no affordable, non-experimental database solution.

- Recent years have seen some progress funded by giant companies like Google for their cloud infrastructure, but we're not yet seeing many CRDTs in production.

The same goes for theory-based programming languages: both Haskell and OCaml have barely survived for 20 years before being used at massive scales. Rust has had a slightly easier time, but not much easier, despite being supported by Mozilla.

Yet, most engineers would consider databases "basic" technology, because they use them a lot, and (quite fortunately) don't need to implement them to build on top of them.


Emacs has this in paredit. A new tree edit mode is a fun exploration in non lisp sources.


Paredit give you multiple ways to achieve one thing. Maybe it is fine for real-time collaborative editing but for a VC-type setup it seems to me that the challenge would be in determining the correct list of operations for a given diff so that the results are unsurprising. Consider how often something simple like standard diffs are wrong (where a new function adds the end to one function and the beginning to the following function as the functions all have a similar ending)


Have you looked at tree edit? I thought it could do this, to an extent.

That all said, I am pretty bullish against this idea, at large. For many reasons, not least that I can't remember that many times it makes this mistake. I noticed it when it happens, but it doesn't actually happen that often.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: