RE: "whereas Ruby has had decent Unicode support for a while" Really? I'd love f...

rspeer · on March 11, 2014

Author of ftfy here - thanks for the shoutout.

By the way, here's another problem with taking the "encoding" parameter at face value: you're opening yourself up to DoS or data corruption bugs in the case where someone tells you to use a dangerous encoding.

There have been multiple bugs found in Python's UTF-7 decoder recently, and generally they were found by people who were scraping the Web with Python. These bugs, such as [1], could cause you to write strings that corrupt your data or crash the Python interpreter. And until the latest version -- and this is possibly still the case in all versions of Python 2 -- someone could give you a gzip bomb that decompresses to petabytes of data, and tell you it's in the "gzip" encoding [2].

I'm sure there are more bugs like this out there, and that Ruby has similar lurking bugs as well, given how recently they changed their Unicode system.

Basically, you shouldn't let someone else's Web page tell you what code to run, unless it's code you're planning to run. I recommend making a short list of encodings you trust, including ASCII, UTF-8, UTF-16, ISO-8859-x, Windows-125x, and MacRoman, and maybe a few others if you're working with CJK text, and just rejecting all others.

(The x's can be filled in with digits. Don't accept UTF-7, because it's clearly horrible. And I don't have any particular reason to be suspicious of UTF-32, but I've never seen anyone seriously use it.)

[1] http://bugs.python.org/issue19279

[2] http://bugs.python.org/issue20404

hatchoo · on March 11, 2014

Thanks for sharing ftfy. I have had several issues that this little library appears to address perfectly