No, I don’t agree. The problem with utf8 byte offsets is that it creates a data ...

chrismorgan · on June 3, 2023

The data validation concern seems fair enough.

> Almost all application programming languages use utf16 anyway, (javascript, c#, swift, Java)

Swift 5 switched to UTF-8: https://www.swift.org/blog/utf8-string/. I’m hopeful that other UTF-16 environments might eventually manage to switch to UTF-8 internally despite retaining some UTF-16 code unit semantics for compatibility; two projects have already demonstrated you can very practically do this sort of thing: Servo from fairly early on (WTF-8 despite the web’s UTF-16 code unit semantics), and PyPy since 7.1 (UTF-8 despite code point semantics, not sure what they do about surrogate code points). I know the web has largely backed away from UTF-16 and uses code point semantics (well, scalar values plus loose surrogates) on almost all new stuff, with good UTF-8 support too.

wtetzner · on June 4, 2023

> Using Unicode codepoints makes invalid state unrepresentable.

Is that true? Maybe it's not "invalid", but you might very well slice through the middle of a grapheme cluster.

chrismorgan · on June 4, 2023

I contemplated querying that myself, but decided that for CRDT editing purposes, it’s probably never practical to think about grapheme clusters. Given text in a language with syllabic grapheme clusters (e.g. Indic languages), if you start with “ba” (one syllable/grapheme cluster) and edit it to “boca” (two syllables/grapheme clusters), you could say “replaced ‘ba’ with ‘boca’”, but I’d be surprised if any CRDT did it that way if it could instead handle it as “inserted ‘oc’”, even though “oc” mightn’t make sense by itself, linguistically. But Unicode doesn’t do much in the way of defining what makes sense or not, and I don’t think there’s any coherent “bad grapheme clustering” detection algorithm. (Aside: on reflection, my Indic languages example is messier still, since -a probably represents the inherent vowel, so in “ba” → “boca” those “a”s are probably actually represented by the absence of a vowel sign code point—and if you wanted to suppress the inherent vowel, you’d need to add a virama sign. Fun stuff.)

But then again, I know that some CRDTs struggle with interleaving, and maybe grapheme-awareness could help things out in some way or other. I dunno.

josephg · on June 4, 2023

Yeah I agree. I think its inevitable that collaboratively edited documents sometimes end up with grapheme clusters that are considered invalid by some peers, simply because different peers might be using different versions of unicode. If my phone supports the polar bear emoji and yours doesn't, you'll see weird stuff instead of a polar bear. There's no getting around that.

And yes, using unicode codepoints, buggy clients might insert extra unicode characters in the middle of a grapheme cluster. But ... Eh. Fine. I'm not super bothered by that from a data validation perspective.

Why don't I have the same attitude toward invalid UTF8? I mean, the CRDT could syncronize arbitrary arrays of bytes that by agreement contain valid UTF8, and treat it as user error in the same way if that happens? Two reasons. First, because some languages (eg rust) strictly enforce that all strings must contain valid UTF8. So you can't even make a document into a String if it has invalid UTF8. We'd need a try_ codepath, which makes the API worse. Secondly, languages like javascript which store strings using UTF16 don't have an equivalent encoding for invalid UTF8 bytes at all. Javascript would have to store the document internally in a byte array or something, and decode it to a string at the frontend. And thats complex and inefficient. That all sounds much worse to me than just treating the document as a sequence of arbitrary unicode codepoints - doing which guarantees correctness and we don't need any of that mess.

chrismorgan · on June 5, 2023

> grapheme clusters that are considered invalid by some peers

I’m not sure if there’s any spec that defines any sequence of Unicode scalar values as “invalid” (though there’s certainly a lot that’s obviously wrong, like some forms of script-mixing). Grapheme cluster segmentation doesn’t concern itself with meaningfulness, but just doing something with what it has; so if you inject something into the middle of what it decided was a cluster, it’ll just split it differently.

> treating the document as a sequence of arbitrary unicode codepoints

Though I hope you validate that they are Unicode scalar values and not code points. Surrogates are terribad.

josephg · on June 7, 2023

> Though I hope you validate that they are Unicode scalar values and not code points. Surrogates are terribad.

Yes, my mistake. I do mean scalar values. I am constantly confused about the terminology for unicode. (Unicode code point? Scalar values? (Character?) Surrogate pair? Is there a term for half of a surrogate pair?)