Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> it cannot leave the input data as an undecoded bag of bytes

But all it's doing here is taking a hex string (which is entirely ASCII) and converting it into the respective hex representation. Since ASCII translates unambiguously to bytes, it doesn't really matter if `str[0]` is operating on a byte stream, codepoint stream or grapheme stream, because in utf8, they're all the same thing as long as we're within the ASCII range.

Where things get hairy is stuff like `str.reverse()` over arbitrary strings that may or may not be in ASCII. This repo[0] talks about some of the challenges associated with conflating characters with either bytes or codepoints. The problem is that programming languages often approach strings from the wrong angle: you can't just tack on handling of multi-byte codepoints on top of ascii handling; you lose O(1) random access and you don't actually model the linguistic domain properly by doing so, because in the first place, humans think of characters not in terms of bytes or codepoints, but in terms of grapheme clusters. Clustering correctness falls deep in the realm of linguistics, and is therefore arguably more suitable to be handled by a library than a programming language.

[0] https://github.com/mathiasbynens/esrever



I agree entirely with your second paragraph, but regarding this:

> hex string (which is entirely ASCII)

My point is that JSON doesn't need to be UTF-8 or a superset of ASCII to be valid. It can be any representation of Unicode, including UTF-16, UTF-32, GB 18030, etc.; so long as the text is is comprised of Unicode code points in some Unicode transformation format, the JSON is valid.

As I said in the parent comment: if you are working within UTF-8 exclusively, and can assume valid UTF-8, then great! But this isn't necessarily true, and in some cases, you will still need to care about the encoding.

(Either way, this starts straying slightly from the more general discussion at hand: regardless of the encoding of the string, you will still need an ergonomic way of interacting with the contents of the data in order to meaningfully parse the contents — even past the hurdle of decoding from arbitrary bytes, you still need to manipulate the data reasonably. In some cases, this means working with a buffer of bytes; in others, it makes sense to manipulate the data as a string... In which case, you may run into some of the string manipulation ergonomic considerations being discussed around these comments.)


> JSON doesn't need to be UTF-8 or a superset of ASCII to be valid. It can be any representation of Unicode, including UTF-16, UTF-32, GB 18030, etc

Sure, it can also be gzipped, encrypted, etc but that goes back to the point that there's nothing inherently special about JSON as it relates to encoding to a byte stream. All there is to it is that somewhere in a program there's an encode/decode contract to extract meaning out of the byte stream, and in a protocol one most likely only looks at byte streams as sequences of bytes (because performance-wise, it doesn't make sense to look at payload size in terms of number of codepoints/graphemes at a protocol level)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: