Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> there are plenty of other cases where they will change (e.g. precomposed characters that don't have a precomposed upper case).

Just curious, is there still some cases if you only consider NFKD strings/characters?



Yes; the Turkish "I"s under discussion here are the most immediate case, but there are other cases where you have two almost-aliases in one case that aren't present in another case even ignoring composition. E.g. the ohm symbol "Ω" lowercases to a standard omega "ω", but that uppercases to a standard uppercase omega "Ω", because there's a distinct codepoint for "ohm symbol" (even though it's "just" omega, perhaps because some legacy codepages included it as a symbol without including a full greek alphabet) but no corresponding lowercase codepoint.


> E.g. the ohm symbol "Ω" lowercases to a standard omega "ω", but that uppercases to a standard uppercase omega "Ω", because there's a distinct codepoint for "ohm symbol" (even though it's "just" omega, perhaps because some legacy codepages included it as a symbol without including a full greek alphabet) but no corresponding lowercase codepoint.

Except that the NFKD form (which I was specifically asking) for 'OHM SIGN' is 'GREEK CAPITAL LETTER OMEGA'.


Ah sorry, I was sure I'd read NFD. Will look more.


Is that actually the right thing to do, or is it another mistake?


As you said and linked elsewhere in the thread, the unicode consortium takes the viewpoint that there should be one codepoint for each glyph, even if that glyph has multiple semantic meanings in different languages (e.g. "U"). So by that standard they should probably be the same codepoint, but in that case it's hard to argue that roman capital I and turkish capital dotless I should be different codepoints.

Alternately you could argue that ohm symbol shouldn't lowercase to omega, which, maybe. I think the right view is simply that lower- and upper-casing aren't always well defined, are culturally and contextually dependent, and are probably something you should only ever be doing for display, not for semantic purposes. (If you want to do case-insensitive comparisons of strings, Unicode comes with algorithms for that which do a better job than upper- or lower-casing the strings before comparing)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: