Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It is probably worth including a character outside the basic multilingual plane (e.g. anything above 0x10000, like http://unicodelookup.com/#0x22222/1) when testing UTF-8 web support. I recently was working on a Japanese teaching web application that needed such characters and sadly learned that MySQL versions before 5.5 do not support UTF-8 characters outside of BMP (anything that needs more than 4 UTF-8 octets) and text to image drawing library support was also sketchy.


4 octets of UTF-8 suffice to cover all Unicode characters. Unicode is essentially 21-bit (U+0 to U+10FFFF), not 32-bit. The BMP is 16 bits, U+0 to U+FFFF. 3 octets suffice for it.

It's useful to know that MySQL support outside the BMP doesn't work, but I would guess it's a generic problem affecting all Unicode support, not restricted to UTF-8.

(Yes, UTF-8 was defined to go up to 6 octets and cover 31 bits. As used with Unicode, only up to 4 are supposed to be used...)


Yes you are right on the 3 vs 4 octets for outside BMP, it is the 4 octet UTF8 that MySQL pre 5.5 doesnt work with. With MySQL 5.5 the full basic LAMP stack at least will now handle non-BMP characters.


(My guess is wrong. They really did have a hardcoded 3 octet pseudo-characterset. Ugh.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: