I contemplated querying that myself, but decided that for CRDT editing purposes,...

josephg · on June 4, 2023

Yeah I agree. I think its inevitable that collaboratively edited documents sometimes end up with grapheme clusters that are considered invalid by some peers, simply because different peers might be using different versions of unicode. If my phone supports the polar bear emoji and yours doesn't, you'll see weird stuff instead of a polar bear. There's no getting around that.

And yes, using unicode codepoints, buggy clients might insert extra unicode characters in the middle of a grapheme cluster. But ... Eh. Fine. I'm not super bothered by that from a data validation perspective.

Why don't I have the same attitude toward invalid UTF8? I mean, the CRDT could syncronize arbitrary arrays of bytes that by agreement contain valid UTF8, and treat it as user error in the same way if that happens? Two reasons. First, because some languages (eg rust) strictly enforce that all strings must contain valid UTF8. So you can't even make a document into a String if it has invalid UTF8. We'd need a try_ codepath, which makes the API worse. Secondly, languages like javascript which store strings using UTF16 don't have an equivalent encoding for invalid UTF8 bytes at all. Javascript would have to store the document internally in a byte array or something, and decode it to a string at the frontend. And thats complex and inefficient. That all sounds much worse to me than just treating the document as a sequence of arbitrary unicode codepoints - doing which guarantees correctness and we don't need any of that mess.

chrismorgan · on June 5, 2023

> grapheme clusters that are considered invalid by some peers

I’m not sure if there’s any spec that defines any sequence of Unicode scalar values as “invalid” (though there’s certainly a lot that’s obviously wrong, like some forms of script-mixing). Grapheme cluster segmentation doesn’t concern itself with meaningfulness, but just doing something with what it has; so if you inject something into the middle of what it decided was a cluster, it’ll just split it differently.

> treating the document as a sequence of arbitrary unicode codepoints

Though I hope you validate that they are Unicode scalar values and not code points. Surrogates are terribad.

josephg · on June 7, 2023

> Though I hope you validate that they are Unicode scalar values and not code points. Surrogates are terribad.

Yes, my mistake. I do mean scalar values. I am constantly confused about the terminology for unicode. (Unicode code point? Scalar values? (Character?) Surrogate pair? Is there a term for half of a surrogate pair?)