It is probably worth including a character outside the basic multilingual plane ...

wnoise · on Aug 20, 2011

4 octets of UTF-8 suffice to cover all Unicode characters. Unicode is essentially 21-bit (U+0 to U+10FFFF), not 32-bit. The BMP is 16 bits, U+0 to U+FFFF. 3 octets suffice for it.

It's useful to know that MySQL support outside the BMP doesn't work, but I would guess it's a generic problem affecting all Unicode support, not restricted to UTF-8.

(Yes, UTF-8 was defined to go up to 6 octets and cover 31 bits. As used with Unicode, only up to 4 are supposed to be used...)

zaphoyd · on Aug 20, 2011

Yes you are right on the 3 vs 4 octets for outside BMP, it is the 4 octet UTF8 that MySQL pre 5.5 doesnt work with. With MySQL 5.5 the full basic LAMP stack at least will now handle non-BMP characters.

wnoise · on Aug 21, 2011

(My guess is wrong. They really did have a hardcoded 3 octet pseudo-characterset. Ugh.)