Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Code point length is the most useful for people who are actually writing string algorithms based upon Unicode.

What algorithms would you be writing against code points?



Everything that you can do to a Unicode string, except concatenation, is defined in terms of code points. Normalization, case transformations, collation, regexes, layout and rendering and encoding.

For example, let’s say you want to define a “natural sort” order that sorts e.g. “A2” < “A10”. To do that you divide the string at boundaries between code points in ranges of each numeral type that you are supporting (e.g. western numerals, Arabic numerals, Chinese numerals).


Wouldn't layout/rendering etc. be done in terms of grapheme clusters?


Codepoints is best for collaborative text editing / CRDTs (diamond types, automerge, etc). We generally model the document as a big list of unicode codepoints.

We could use grapheme clusters, but the grapheme cluster boundary points change as unicode evolves, and not all systems update at the same time. Separating strings based on grapheme cluster boundaries also requires a big lookup table to be embedded in every app. Unicode codepoints are obvious, stable, and easy to work with. And they're encoding-agnostic, so there's no weird UCS2-to-UTF8-bytes conversion needed in javascript, C#, etc.


Using code points (or scalar values, I hope) just means that it’s inefficient for everyone, because now everyone has to convert indexes (well, except Python, but it has other problems), instead of only half the people.

Going UTF-8 is fairly clearly superior: it will be the wire format, even if it’s not the language’s string format, so now environments that use UTF-8 strings never need any conversions (apart from decoding escape sequences, most likely).

Much as I hate UTF-16, I would even be inclined to argue that UTF-16 was a better choice than code points, as it will reduce the amount of extra work UTF-16 environments have to do, without changing how much UTF-8 environments have to do at all; but it also has the disadvantage that validation is vanishingly rare in UTF-16, so you’re sure to end up with lone surrogate trouble at some point, whereas UTF-8 tooling has a much stronger culture of validation, so you’re much less likely to encounter it directly and can much more comfortably just declare “valid Unicode only”.

Yes, code points is a purer concept to use. I don’t care: it’s less efficient than choosing UTF-8, which adds negative-to-negligible complexity. Please, just abandon code point indexing and embrace the UTF-8.


No, I don’t agree.

The problem with utf8 byte offsets is that it creates a data validation problem. In diamond types I’m using document positions / offsets in my wire format. With utf8 byte offsets, you can receive changes from remote peers which name invalid insertion positions. (Ie an insert inside a character, or deleting half of a codepoint). Validating remote changes received like this is a nightmare, because you need to reconstruct the whole document state to be able to tell if the edit is valid. Using Unicode codepoints makes invalid state unrepresentable. So the validation problem goes away. (You might still need to check that an insert isn’t past the end of the document, but that’s a much easier check).

Almost all application programming languages use utf16 anyway, (javascript, c#, swift, Java) so you still need to convert positions anyway. Even in rust it’s common to see line/col positions from text editors.

Using utf8 byte offsets just doesn’t really give you any benefits in exchange for making validation much harder.


The data validation concern seems fair enough.

> Almost all application programming languages use utf16 anyway, (javascript, c#, swift, Java)

Swift 5 switched to UTF-8: https://www.swift.org/blog/utf8-string/. I’m hopeful that other UTF-16 environments might eventually manage to switch to UTF-8 internally despite retaining some UTF-16 code unit semantics for compatibility; two projects have already demonstrated you can very practically do this sort of thing: Servo from fairly early on (WTF-8 despite the web’s UTF-16 code unit semantics), and PyPy since 7.1 (UTF-8 despite code point semantics, not sure what they do about surrogate code points). I know the web has largely backed away from UTF-16 and uses code point semantics (well, scalar values plus loose surrogates) on almost all new stuff, with good UTF-8 support too.


> Using Unicode codepoints makes invalid state unrepresentable.

Is that true? Maybe it's not "invalid", but you might very well slice through the middle of a grapheme cluster.


I contemplated querying that myself, but decided that for CRDT editing purposes, it’s probably never practical to think about grapheme clusters. Given text in a language with syllabic grapheme clusters (e.g. Indic languages), if you start with “ba” (one syllable/grapheme cluster) and edit it to “boca” (two syllables/grapheme clusters), you could say “replaced ‘ba’ with ‘boca’”, but I’d be surprised if any CRDT did it that way if it could instead handle it as “inserted ‘oc’”, even though “oc” mightn’t make sense by itself, linguistically. But Unicode doesn’t do much in the way of defining what makes sense or not, and I don’t think there’s any coherent “bad grapheme clustering” detection algorithm. (Aside: on reflection, my Indic languages example is messier still, since -a probably represents the inherent vowel, so in “ba” → “boca” those “a”s are probably actually represented by the absence of a vowel sign code point—and if you wanted to suppress the inherent vowel, you’d need to add a virama sign. Fun stuff.)

But then again, I know that some CRDTs struggle with interleaving, and maybe grapheme-awareness could help things out in some way or other. I dunno.


Yeah I agree. I think its inevitable that collaboratively edited documents sometimes end up with grapheme clusters that are considered invalid by some peers, simply because different peers might be using different versions of unicode. If my phone supports the polar bear emoji and yours doesn't, you'll see weird stuff instead of a polar bear. There's no getting around that.

And yes, using unicode codepoints, buggy clients might insert extra unicode characters in the middle of a grapheme cluster. But ... Eh. Fine. I'm not super bothered by that from a data validation perspective.

Why don't I have the same attitude toward invalid UTF8? I mean, the CRDT could syncronize arbitrary arrays of bytes that by agreement contain valid UTF8, and treat it as user error in the same way if that happens? Two reasons. First, because some languages (eg rust) strictly enforce that all strings must contain valid UTF8. So you can't even make a document into a String if it has invalid UTF8. We'd need a try_ codepath, which makes the API worse. Secondly, languages like javascript which store strings using UTF16 don't have an equivalent encoding for invalid UTF8 bytes at all. Javascript would have to store the document internally in a byte array or something, and decode it to a string at the frontend. And thats complex and inefficient. That all sounds much worse to me than just treating the document as a sequence of arbitrary unicode codepoints - doing which guarantees correctness and we don't need any of that mess.


> grapheme clusters that are considered invalid by some peers

I’m not sure if there’s any spec that defines any sequence of Unicode scalar values as “invalid” (though there’s certainly a lot that’s obviously wrong, like some forms of script-mixing). Grapheme cluster segmentation doesn’t concern itself with meaningfulness, but just doing something with what it has; so if you inject something into the middle of what it decided was a cluster, it’ll just split it differently.

> treating the document as a sequence of arbitrary unicode codepoints

Though I hope you validate that they are Unicode scalar values and not code points. Surrogates are terribad.


> Though I hope you validate that they are Unicode scalar values and not code points. Surrogates are terribad.

Yes, my mistake. I do mean scalar values. I am constantly confused about the terminology for unicode. (Unicode code point? Scalar values? (Character?) Surrogate pair? Is there a term for half of a surrogate pair?)


Isn't that the level emoji sequences work at?


I suspect primarily substring, where if you index by bytes you'll mangle the string, but if you index by codepoints everything works out.


You'll mangle the string if you index and search by code points too, when the string contains the emoji this article is about, or for that matter an "e" following by combining acute accent.

The string will be fine if you only move to, split and concatenate at indices outside those grapheme clusters. But that is also true when indexing by bytes or UTF-16 code units.

So in some senses, indexing by bytes is just as good as indexing by code points, but faster. Either way to avoid mangling strings you need to restrict the indices of whatever type to meaningful character boundaries.

If you have decided to avoid string indices inside grapheme clusters, there comes the awkard question of what should you do when editing text in an environment rendered with font ligatures like "->" rendered as → (rightward arrow). From one perspective, that's just a font. From another, the user sees a single character yet there are valid positions (such as from cursor movement and character search) that land mid-way through the character, and editing at those positions changes the character. Neither is clearly best for all situations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: