> If I split() a string, does each piece get its own BOM?
Conceptually, each piece is a sequence of code points. The BOM stuff only comes into play when you turn it into an external encoding. And frankly, I would much rather use UTF-8, explicitly specify that the encoding is UTF-8, and not have to worry about adding a BOM.
> If a chr(0x20) is part of a multi-byte escape sequence, does it count as a space when I use .split()?
In valid UTF-8, all bytes in multibyte characters will have the high bit set. A space can only be represented as the 0x20 byte, and an 0x20 byte can only be a space. If you've got malformed input, then that's a whole other can of worms.
> Is it possible for a zero byte to be part of a multibyte sequence representing a character? How does this work with C API's that expect zero-terminated strings?
In UTF-8, the answer is no. In other multibyte encodings (e.g. UTF-16), you should not expect to be able to treat it at all like ASCII.
> If you're using UTF-16, what endianness is used? Is it the same as the machine endianness, or fixed? What operations cause endian conversion?
When reading external text, you can detect this from the BOM -- byte order, after all, is why you have a byte order marker. When converting from your internal format to UTF-16, you pick whatever is most convenient.
> Should my C programs handle the possibility that sizeof(char) != 1? Or at least check for this case and spit out a warning or error?
I don't know any popular non-embedded platform on which sizeof(char) != 1. That said, it can't hurt to get it Right.
> What automated tools exist to remove BOM's or change accented characters into regular ones, if other automated tools don't accept Unicode?
In Python, there's a library called "unidecode" which does a pretty good job of punching Unicode text until it turns into ASCII.
> If I split() a string, does each piece get its own BOM?
Conceptually, each piece is a sequence of code points. The BOM stuff only comes into play when you turn it into an external encoding. And frankly, I would much rather use UTF-8, explicitly specify that the encoding is UTF-8, and not have to worry about adding a BOM.
> If a chr(0x20) is part of a multi-byte escape sequence, does it count as a space when I use .split()?
In valid UTF-8, all bytes in multibyte characters will have the high bit set. A space can only be represented as the 0x20 byte, and an 0x20 byte can only be a space. If you've got malformed input, then that's a whole other can of worms.
> Is it possible for a zero byte to be part of a multibyte sequence representing a character? How does this work with C API's that expect zero-terminated strings?
In UTF-8, the answer is no. In other multibyte encodings (e.g. UTF-16), you should not expect to be able to treat it at all like ASCII.
> If you're using UTF-16, what endianness is used? Is it the same as the machine endianness, or fixed? What operations cause endian conversion?
When reading external text, you can detect this from the BOM -- byte order, after all, is why you have a byte order marker. When converting from your internal format to UTF-16, you pick whatever is most convenient.
> Should my C programs handle the possibility that sizeof(char) != 1? Or at least check for this case and spit out a warning or error?
I don't know any popular non-embedded platform on which sizeof(char) != 1. That said, it can't hurt to get it Right.
> What automated tools exist to remove BOM's or change accented characters into regular ones, if other automated tools don't accept Unicode?
In Python, there's a library called "unidecode" which does a pretty good job of punching Unicode text until it turns into ASCII.