> there are plenty of other cases where they will change (e.g. precomposed chara...

lmm · on Jan 16, 2015

Yes; the Turkish "I"s under discussion here are the most immediate case, but there are other cases where you have two almost-aliases in one case that aren't present in another case even ignoring composition. E.g. the ohm symbol "Ω" lowercases to a standard omega "ω", but that uppercases to a standard uppercase omega "Ω", because there's a distinct codepoint for "ohm symbol" (even though it's "just" omega, perhaps because some legacy codepages included it as a symbol without including a full greek alphabet) but no corresponding lowercase codepoint.

zokier · on Jan 17, 2015

> E.g. the ohm symbol "Ω" lowercases to a standard omega "ω", but that uppercases to a standard uppercase omega "Ω", because there's a distinct codepoint for "ohm symbol" (even though it's "just" omega, perhaps because some legacy codepages included it as a symbol without including a full greek alphabet) but no corresponding lowercase codepoint.

Except that the NFKD form (which I was specifically asking) for 'OHM SIGN' is 'GREEK CAPITAL LETTER OMEGA'.

lmm · on Jan 17, 2015

Ah sorry, I was sure I'd read NFD. Will look more.

theoh · on Jan 17, 2015

Is that actually the right thing to do, or is it another mistake?

lmm · on Jan 17, 2015

As you said and linked elsewhere in the thread, the unicode consortium takes the viewpoint that there should be one codepoint for each glyph, even if that glyph has multiple semantic meanings in different languages (e.g. "U"). So by that standard they should probably be the same codepoint, but in that case it's hard to argue that roman capital I and turkish capital dotless I should be different codepoints.

Alternately you could argue that ohm symbol shouldn't lowercase to omega, which, maybe. I think the right view is simply that lower- and upper-casing aren't always well defined, are culturally and contextually dependent, and are probably something you should only ever be doing for display, not for semantic purposes. (If you want to do case-insensitive comparisons of strings, Unicode comes with algorithms for that which do a better job than upper- or lower-casing the strings before comparing)