Well let's try :grinningface:,:grinningface: which is d83d de00 002c d83d de00 in UTF16BE. If you extract first column by naively cutting the blob bytewise before the comma you end up with d83d de00 00 with extra NUL byte, which is a problem. With UTF16LE you'd prepend NUL which is even worse.
That doesn't make sense. If you're working with utf16, why would you slice bytewise? That's like slicing a zip file bytewise and wondering why it got corrupted.
The whole point of the argument for string support at a library level (rather than assuming some sort of equivalence between stringness and its underlying byte buffer at the language level) is that fixed width bytes fundamentally cannot model human language characters unambiguously because what a "character" is depends on the encoding/decoding contract of the program manipulating the byte buffer.
Assuming equivalence between 0x2C and `,` stems from the ancient history of ASCII, english and usage of C `char` as a mechanism to squeeze performance out of string operations by not properly supporting the full gamut of valid human language characters.
For a low level language that might be used to implement protocols, it totally makes sense that foo.len is length in bytes, because you're pretty much never going to want to know number of grapheme clusters at a protocol level. It doesn't make sense for a language level .len to be length in terms of codepoint count because that assumes encoding, which is fundamentally a business logic level concern.
The entire matter in question is whether CSV is encoding-independent, operating on bytes (we’re addressing AndyKelley’s comment). The answer clearly demonstrated here is: no, CSV is operating upon characters, not bytes, so you need to decode the Unicode first and let the CSV operate on Unicode data, so that it’s splitting on U+002C, rather than 0x2C in the byte stream before Unicode decoding which destroys the data.