RE: "whereas Ruby has had decent Unicode support for a while"
Really? I'd love for some examples for where Ruby shines when it comes to Unicode handling when dealing with web content.
I know a lot of work was done in Ruby 1.9+ to bring decent Unicode encoding support to the language, but I still see a good number of complaints/articles about issues with it.
The encoding issues I've run into with python 2 have generally been whatever framework I'm using to ingest the content took a website at face value for encoding: either it wasn't defined at all or it was defined incorrectly.
In the wild wild world of web, unless you're doing intelligent data inspection, you're just going to run into that sort of thing.
By the way, here's another problem with taking the "encoding" parameter at face value: you're opening yourself up to DoS or data corruption bugs in the case where someone tells you to use a dangerous encoding.
There have been multiple bugs found in Python's UTF-7 decoder recently, and generally they were found by people who were scraping the Web with Python. These bugs, such as [1], could cause you to write strings that corrupt your data or crash the Python interpreter. And until the latest version -- and this is possibly still the case in all versions of Python 2 -- someone could give you a gzip bomb that decompresses to petabytes of data, and tell you it's in the "gzip" encoding [2].
I'm sure there are more bugs like this out there, and that Ruby has similar lurking bugs as well, given how recently they changed their Unicode system.
Basically, you shouldn't let someone else's Web page tell you what code to run, unless it's code you're planning to run. I recommend making a short list of encodings you trust, including ASCII, UTF-8, UTF-16, ISO-8859-x, Windows-125x, and MacRoman, and maybe a few others if you're working with CJK text, and just rejecting all others.
(The x's can be filled in with digits. Don't accept UTF-7, because it's clearly horrible. And I don't have any particular reason to be suspicious of UTF-32, but I've never seen anyone seriously use it.)
Really? I'd love for some examples for where Ruby shines when it comes to Unicode handling when dealing with web content.
I know a lot of work was done in Ruby 1.9+ to bring decent Unicode encoding support to the language, but I still see a good number of complaints/articles about issues with it.
The encoding issues I've run into with python 2 have generally been whatever framework I'm using to ingest the content took a website at face value for encoding: either it wasn't defined at all or it was defined incorrectly.
In the wild wild world of web, unless you're doing intelligent data inspection, you're just going to run into that sort of thing.
In python, that's why projects like this exist: https://github.com/LuminosoInsight/python-ftfy
They let you correct Unicode content that was decoded with the wrong encoding.