TLDR: node sucks

akie · on Nov 26, 2012

TLDR: The V8 engine can't (supposedly) encode Unicode codepoints that are over 16-bits in length, because it uses the UCS-2 encoding.

throwaway54-762 · on Nov 26, 2012

TLDR: v8 "sucks" (and doesn't support Unicode code points outside of the lowest ~64k characters).

Edit: v8 in general is pretty cool, but not supporting Unicode outside UCS-2 is pretty bad.

marshray · on Nov 27, 2012

Most apps seem to "support" surrogate pairs by simply not being aware of them at all.

Good on the V8 developers for recognizing these conditions that their code didn't fully handle and refusing to muddle on through with broken processing.

tptacek · on Nov 26, 2012

It's v8's fault, and v8 does not suck.

prodigal_erik · on Nov 26, 2012

Unicode 2.0 added surrogate pairs in 1996. Unfortunately, the first versions of both Java and JavaScript predated this and got strings horribly wrong, and now any conforming implementation of either is required to suck. The Right Thing would be for almost everyone to work with only combining character sequences, except for a rare few who need to know how to dissect one into its codepoints and reassemble them correctly (just as people don't normally need to extract high or low bits from an ASCII character).

jrabone · on Nov 26, 2012

No. Combining characters and NF(K)C/D normalisation rules are a different problem entirely - consider the "heavy metal umlaut" (ie. Spın̈al Tap) where there is no lossless conversion possible - only “n" followed by U+0308

prodigal_erik · on Nov 27, 2012

They're facets of the same problem. I shouldn't routinely be dealing with either surrogates or combining marks; unless I have a specific reason, it's only an opportunity to make a mistake that hardly anyone knows how to troubleshoot. "n̈" should be an indivisible string of length one until I need to ask how it would actually be encoded in UTF-16 or whatever.

jrabone · on Nov 27, 2012

But that's the point - there is no such character. Given the Unicode consortium have added codepoints for every other bloody thing under the sun, I'm amazed that there isn't one for n-diaresis but there you are.

Add a small number of people who for artistic reasons decide that they want to make life hard (Rinôçérôse I'm looking at you) and you just have to accept that the length of your string might not equal the number of codepoints contained therein...

sneak · on Nov 26, 2012

Damn you for being right. :)

rymith · on Nov 26, 2012

Not even a little bit accurate.