Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In this article, Joel says he "decided to do everything internally in UCS-2 (two byte) Unicode". He fell into the same trap that MySQL did and the software he describes would also fail on Emoji.

It is still a very informative piece and Joel was way ahead of the curve by supporting and evangelizing Unicode at all in 2003. But it is not the best article to point the OP at, as it does not mention the BMP or discuss proper handling of characters beyond the BMP.



I never got that. He wrote a great article about the issue and then ended with using UCS-2. I always wondered if there was something I missed that made him choose UCS-2 over UTF-8, since UTF-8 can represent every Unicode code point.


I imagine it's linked to Win32 and .Net using UCS-2 internally, since then there's no need to convert strings before and after API calls.


Actually, Joel's article is also outdated, e.g. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

See:

What is the maximum number of bytes for a UTF-8 encoded character?

http://stackoverflow.com/questions/9533258/what-is-the-maxim...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: