You seem to be missing the point of this exercise: HFS+ is a Unicode-aware files...

lmm · on July 11, 2012

>HFS+ is a Unicode-aware filesystem, so the idea that "you want your filenames encoded as ISO-8859-1" is fundamentally invalid and unimplementable

Sure. But it's still the idea that linux would apply, and how linux APIs (that expect a filename to be a stream of bytes) work. I really think this is a linux/locale problem rather than a python problem.

saurik · on July 11, 2012

I will yet again repeat: the reason for this excursion into HFS+ semantics on Linux was caused by thristian's insistence that Python's behavior would handle HFS+'s Unicode behavior when mounted on Linux in the same correct way it does on Mac OS X. This is, in fact, false. This then nullifies the argument that this is a filesystem-specific issue.

You seem to be refusing to track this conversation's multiple thoughts: there is the underlying argument "Python 3 is making unreasonable assumptions" with a specific argument "these assumptions are reasonable on OS X" followed by an aside "incidentally, this behavior actually is not related to operating systems but is related to filesystems: as proof I cite HFS+ mounted on Linux" with an error pointed out in the aside "no: in fact HFS+ on Linux has the same behavior as any other filesystem on Linux".

I then separately respond to the point about "these semantics work on OS X" (ceding, in fact, albeit explicitly remaining skeptical on Windows), saying that the tradeoffs of "works worse on Linux" (which I get to assume, as my earlier arguments that this is the case were not actually challenged: that on Linux the concept of encodings does not apply to filenames and causes problems like locale-specific sys.path) seems like the wrong direction to lean (which is an opinion, of course).

However, to make that claim, I need to defend against a new point that is brought up: that thristian believes that an epicycle added to the algorithm (the PUA "save the problem for later" mechanism) is sufficient to mitigate the Linux problems. I claim that it is not, and I bring up a few reasons why (de novo filenames, interop with non-Python systems, existing usages for PUA): reasons which, incidentally, were also discussed as open problems on the Python mailing list.

Finally, I also explained that the PUA solution isn't even being used anymore, but was actually replaced by UTF-8b. As this solves one of my complaints (existing usages for PUA) I then have to first admit that (although I defend that I believe that invalid surrogate pairs are not invalid on Windows, leading to a similar problem) and then, for clarity, mention that my other arguments are not affected by UTF-8b.

lmm · on July 12, 2012

So, in the interests of being perfectly clear: I am challenging your claim that python 3's approach works worse on Linux; I assert that its semantics under linux are correct (i.e. what a well behaved program running under linux-the-system should do). Conceptually, a program should tell the operating system to save a filename under a given name (unicode string); it is then the operating system's responsibility to translate that to and from bytes on disk.

What you have observed, and demonstrated with your example, is the behaviour of linux running with LANG=fr_FR.ISO-8859-1, which is to represent filenames that contain characters not representable in ISO-8859-1 as ?s. Any well-behaved linux program will exhibit the same behaviour, because it is not program behaviour but OS behaviour. Programs that ignore LANG and do their own filename encoding will appear better under your test, but such programs are misbehaving; by declaring LANG=fr_FR.ISO-8859-1 the user has made an explicit declaration that they wish for their filenames to be encoded as ISO-8859-1, and should expect as much.

That filenames of files stored on HFS+ under linux still have linux semantics despite the filesystem's semantics being different is an interesting accident of history but really neither here nor there. The idea that you want your filenames encoded as ISO-8859-1 may indeed be fundamentally invalid and unimplementable on HFS+, but it remains the semantics of setting LANG=fr_FR.ISO-8859-1 on linux, and as such it should be expected that linux would attempt to follow this behaviour as closely as possible.

Really the whole excursion into filesystems is irrelevant. Python 3 behaves correctly operating systems which provide unicode filenames, i.e. "OSX" and "Linux with a UTF8 locale", and as well as could be expected on operating systems where filenames are only permitted to be strings in a particular encoding i.e. "Linux with a non-UTF8 locale".