> it cannot leave the input data as an undecoded bag of bytes But all it's doing...

_rend · on Dec 21, 2021

I agree entirely with your second paragraph, but regarding this:

> hex string (which is entirely ASCII)

My point is that JSON doesn't need to be UTF-8 or a superset of ASCII to be valid. It can be any representation of Unicode, including UTF-16, UTF-32, GB 18030, etc.; so long as the text is is comprised of Unicode code points in some Unicode transformation format, the JSON is valid.

As I said in the parent comment: if you are working within UTF-8 exclusively, and can assume valid UTF-8, then great! But this isn't necessarily true, and in some cases, you will still need to care about the encoding.

(Either way, this starts straying slightly from the more general discussion at hand: regardless of the encoding of the string, you will still need an ergonomic way of interacting with the contents of the data in order to meaningfully parse the contents — even past the hurdle of decoding from arbitrary bytes, you still need to manipulate the data reasonably. In some cases, this means working with a buffer of bytes; in others, it makes sense to manipulate the data as a string... In which case, you may run into some of the string manipulation ergonomic considerations being discussed around these comments.)

lhorie · on Dec 21, 2021

> JSON doesn't need to be UTF-8 or a superset of ASCII to be valid. It can be any representation of Unicode, including UTF-16, UTF-32, GB 18030, etc

Sure, it can also be gzipped, encrypted, etc but that goes back to the point that there's nothing inherently special about JSON as it relates to encoding to a byte stream. All there is to it is that somewhere in a program there's an encode/decode contract to extract meaning out of the byte stream, and in a protocol one most likely only looks at byte streams as sequences of bytes (because performance-wise, it doesn't make sense to look at payload size in terms of number of codepoints/graphemes at a protocol level)