These three points has made me raving mad from working with mysql:
- The default 'latin1' character set is in fact cp1252, not ISO-8859-1, meaning it contains the extra characters in the Windows codepage. 'latin2', however, is ISO-8859-2.
- The 'utf8' character set is limited to unicode characters that encode to 1-3 bytes in UTF-8. 'utf8mb4' was added in MySQL 5.5.3 and supports up to 4-byte encoded characters. UTF-8 has been defined to encode characters to up to 4 bytes since 2003.
- Neither the 'utf8' nor 'utf8mb4' character sets have any case sensitive collation other than 'utf8_bin' and 'utf8mb4_bin', which sort characters by their numeric codepoint.
utf8 being effectively alias of utf8mb3 has cost us so much work its not even funny.
> utf8 being effectively alias of utf8mb3 has cost us so much work its not even funny.
An extra warning about that mess: mysqldump in many configurations will silently convert utf8mb4 down to utf8mb3. So when you're testing your backups or migrations, do an extra check to make sure that emoji and rarer characters didn't get eaten!
I am currently trying to fix a program that was made by a person that didn't knew those details of MySQL...
Most weirdly, the fact that the default collation is SWEDISH.
It is a complete freak show, the users kinda got used to it, butchering our language (portuguese) to use only characters valid in english, hoping MySQL won't barf spetacularly on them.
Most weirdly, the fact that the default collation is SWEDISH. It is a complete freak show,
Unless you're Swedish, I imagine. Then it's quite handy.
I believe the author of MySQL was Swedish, so to me it all makes sense. It also provides a learning opportunity for people who believe the entire planet operates on ASCII.
> - The default 'latin1' character set is in fact cp1252, not ISO-8859-1, meaning it contains the extra characters in the Windows codepage.
Actually, it's generally saner to assume that people mean Windows-1252 when they say ISO-8859-1. Charset labeling is frequently incorrect, and C1 characters are so infrequently used that seeing one pop up probably means you actually wanted Windows-1252 instead.
Well, the thing is that Mysql latin1/cp1252 isn't actually code page 1252, it's a tad different. There are extra characters that CP1252 doesn't know about, so it's not entirely compatible with CP1252 or Windows-1252.
These three points has made me raving mad from working with mysql:
- The default 'latin1' character set is in fact cp1252, not ISO-8859-1, meaning it contains the extra characters in the Windows codepage. 'latin2', however, is ISO-8859-2. - The 'utf8' character set is limited to unicode characters that encode to 1-3 bytes in UTF-8. 'utf8mb4' was added in MySQL 5.5.3 and supports up to 4-byte encoded characters. UTF-8 has been defined to encode characters to up to 4 bytes since 2003. - Neither the 'utf8' nor 'utf8mb4' character sets have any case sensitive collation other than 'utf8_bin' and 'utf8mb4_bin', which sort characters by their numeric codepoint.
utf8 being effectively alias of utf8mb3 has cost us so much work its not even funny.