Well let's try :grinningface:,:grinningface: which is d83d de00 002c d83d de00 i...

lhorie · on Dec 21, 2021

> cutting the blob bytewise

That doesn't make sense. If you're working with utf16, why would you slice bytewise? That's like slicing a zip file bytewise and wondering why it got corrupted.

The whole point of the argument for string support at a library level (rather than assuming some sort of equivalence between stringness and its underlying byte buffer at the language level) is that fixed width bytes fundamentally cannot model human language characters unambiguously because what a "character" is depends on the encoding/decoding contract of the program manipulating the byte buffer.

Assuming equivalence between 0x2C and `,` stems from the ancient history of ASCII, english and usage of C `char` as a mechanism to squeeze performance out of string operations by not properly supporting the full gamut of valid human language characters.

For a low level language that might be used to implement protocols, it totally makes sense that foo.len is length in bytes, because you're pretty much never going to want to know number of grapheme clusters at a protocol level. It doesn't make sense for a language level .len to be length in terms of codepoint count because that assumes encoding, which is fundamentally a business logic level concern.

chrismorgan · on Dec 22, 2021

The entire matter in question is whether CSV is encoding-independent, operating on bytes (we’re addressing AndyKelley’s comment). The answer clearly demonstrated here is: no, CSV is operating upon characters, not bytes, so you need to decode the Unicode first and let the CSV operate on Unicode data, so that it’s splitting on U+002C, rather than 0x2C in the byte stream before Unicode decoding which destroys the data.

Akira1364 · on Dec 21, 2021

I think you somewhat misunderstood the manner I was suggesting you'd be approaching things there, though nothing you've said is incorrect.

chrismorgan · on Dec 22, 2021

Then I’m not sure what you were suggesting. The crux is that CSV operates upon strings, not bytes.