> I think on the face of it I do like the Go approach of "everything is a byte string in utf-8" a lot, but I haven't really worked with it so there's probably some horrible pain there somewhere, too.
The problem with "everything is a byte string in utf-8" is simply that it's false. Some byte strings are in UTF-16, some are in Big5, and some aren't text at all. I assume that the intention is that all non-utf-8 input gets converted as soon as possible and all non-utf-8 output as late as possible; this is essentially the Python 3 idea, except with a type system that tells you when you messed it up. I've seen Python 2 projects that used this approach, but I prefer to have an exception thrown as soon as I make a mistake (instead of choking on a Chinese HTML file three months later, or throwing up mojibake)
The problem with "everything is a byte string in utf-8" is simply that it's false. Some byte strings are in UTF-16, some are in Big5, and some aren't text at all. I assume that the intention is that all non-utf-8 input gets converted as soon as possible and all non-utf-8 output as late as possible; this is essentially the Python 3 idea, except with a type system that tells you when you messed it up. I've seen Python 2 projects that used this approach, but I prefer to have an exception thrown as soon as I make a mistake (instead of choking on a Chinese HTML file three months later, or throwing up mojibake)