Hacker Newsnew | past | comments | ask | show | jobs | submitlogin



> The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.

The native JS semantics are UCS-2. Saying that it's UTF-16 is misleading and confuses charset, encoding and browser APIs.

Ladybird is probably implementing support properly but it's annoying that they keep spreading the confusion in their article.


It's not cleanly one or the other, really. It's UCS-2-y by `str.length` or `str[i]`, but UTF-16-y by `str.codePointAt(i)` or by iteration (`[...str]` or `for (x of str)`).

Generally though JS's strings are just a list of 16-bit values, being intrinsically neither UCS-2 nor UTF-16. But, practically speaking, UTF-16 is the description that matters for everything other than writing `str.length`/`str[i]`.


• Regular indexing (also charAt and charCodeAt) is by UTF-16 code unit and produces UTF-16 code units.

• codePointAt is indexed by UTF-16 code unit, but produces Unicode code points (normally scalar values, but surrogates where ill-formed).

• String iteration doesn’t need indexing, and thus is Unicody, not UTF-16y.

• Approximately everything that JavaScript interacts with is actually UTF-8 now: URIs have long been UTF-8 (hence encodeURI/decodeURI/encodeURIComponent being UTF-8y).

• Where appropriate, new work favours UTF-8 semantics.

—⁂—

Overall, I’d say it’s most reasonable to frame it this way:

① JavaScript models strings as potentially-ill-formed UTF-16. (I prefer the word “models” to the word “represents” here, because the latter suggests a specific storage, which is not actually necessary.)

② Old parts of JavaScript depend on indexing, and use potentially-ill-formed UTF-16 code unit semantics.

③ New parts of JavaScript avoid indexing, and use Unicode semantics.


And most mainstream GUI toolkits are, as well. It can be said that UTF-16 is the de-facto standard in-memory representation of unicode strings, even though some runtimes (Rust) prefer UTF-8.


> And most mainstream GUI toolkits are, as well.

No. Windows use UTF-16 internally. Most GUI toolkits do not.

> It can be said that UTF-16 is the de-facto standard in-memory representation of unicode strings, even though some runtimes (Rust) prefer UTF-8.

No, that wouldn't be true at all.

Your technical merit seem to be limited by your Windows experience, and even that is dated.

Microsoft recommends UTF-8 over UTF-16 since 2019 [1].

1: https://learn.microsoft.com/en-us/windows/apps/design/global...


> Most GUI toolkits do not.

Why are you guys talking like there were dozens of GUI toolkits in mainstream use? It's basically web stuff, Qt, and then everything else. Web would be UTF-16 as discussed above, Qt is UTF-16, and even if we entertain the admittedly "large just behind-the-scenes" Java/.NET market, that's also all UTF-16. WxWidgets being a fence sitter can do both UTF-8 and UTF-16, depending on the platform.

Which players am I missing? GTK and ImGUI? I don't think they are too big a slices of this pie, certainly not big enough to invalidate the claim.


Anything that is using the C stdlib at one point.


Apple also uses some kind of UTF-16 internally, afaik




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: