Why? If you can count characters (code points) then it's natural that you can split or substring by characters.
Try this in javascript:
'안녕하세요'.substr(2,2)
Internally Fixed length encoding is much faster than variable-length encoding.
> Unicode does not work that way.
It DOES.
> Splitting on characters is garbage.
You messed up Unicode in Python in so many levels. Those characters you seen in Python console is, actually not Unicode. These are just bytes in sys stdout that happens be to correctly decoded and properly displayed. You should always use the u'' for any kind of characters. '안녕하세요' is WRONG and may lead to unspecified behaviors, it depends on your source code file encoding, intepreter encoding and sys default encoding, if you display them in console it depends on the console encoding, if it's GUI or HTML widget it depends on the GUI widget or content-type encoding.
> I'm not even leaving the BMP and it's broken!
Your unicode-fu is broken. Looks like your example provided identical Korean strings, which might be ICU module in Chrome auto normalized for you.
> You can't split decomposed Korean on character boundaries.
In a broken unicode implementation, like Chrome browser v8 js engine.
> I happen to be using Python 3. It is internally using UCS-4.
I'm sorry but you're wrong. I suggest you inform yourself better of the subject you're talking about before you call people "ignorant morons" next time.
dietrichepp is talking about Normalized Form D, which is a valid form of Unicode and cannot be counted using codepoints like you're doing.
This is a UTF-32 code unit, not a UTF-16 code unit. Even UTF-32 doesn't help when you have combining characters. I suggest you read dietrichepp's post again, he's talking about Normalization Form D.
I am suggesting that people read about unicode before designing supposedly cross-platform applications or programming languages. It's not that hard, just different than ASCII.
Since you understand Unicode so well, can you explain dietrichepp's theory that Unicode don't need counting or offsets?
Unicode doesn't have "characters." If you talk about characters, all you've succeeded in doing is confusing yourself. Leave characters back in ASCII-land where they belong.
Counting code points is stupid. If you like counting code points, go sit in the corner. You don't understand unicode.
You can count graphemes, but it's not going to be easy. And most of the time, I don't see why you would need to do that.
And why UCS4 (Not variable-length) is chosen in many Unicode implementations? Why wchar_t is always 32bit in posix?
wchat_t is a horrible abomination that begs for death. Nobody should use it. Use UTF-8 instead. I think Python used to use UCS4, but they don't any more. It's a horrible representation because all your strings bloat up by 4x.
Consider the following sequence of code points:
U+0041 U+0308 [edit: corrected sequence]
That equals this european letter: Ä
Two code points, one letter. MAGIC! You can also get the same-looking letter with a single code point using U+00C4 (unicode likes redundancy).
Not all languages have letters. Not all languages that have letters represent each one with a single code point. Please think twice before calling people "morons."
I am responding to your earlier post which announced that UCS2 is better than UTF8 internally because it counts unicode characters faster than UTF8. Hopefully now you understand that just taking the number of UCS2 bytes and dividing by 2 does not give you the number of letters.
Just in case you don't, let's walk through it again.
UCS-16 big-endian represenation of Ä:
0x00 0x41 0x03 0x08
Another UCS-16 big-endian representation of Ä:
0x00 0xc4
If you look at the number of bytes, the first example has 4. It represents one letter. The second example has 2. It also represents one letter. Conclusion: UCS2 does not "count unicode characters faster than UTF8." You still have to look at every byte to see how many letters you have, same as in UTF-8.
Do you grasp this? If not, maybe you are one of those "ascii-centric ignorant morons" I keep hearing so much about.
Yes? `System.Globalizatiion` or `ICU` can count grapheme, what's your point?
Those libraries are equivalent to normalize( utf16 `0x00 0x41 0x03 0x08`) == length 1
Back to my top comment, I stated that UCS2 counts faster than UTf8 internally, because every BMP code point is just two bytes, what's wrong here? If variable-length is so good why py3k is using UCS-4 internally? (Wich means every character is exactly 32 bits. There, I said character again.)
> Back to my top comment, I stated that UCS2 counts faster than UTf8 internally
The part cmccabe tries to explain, and which you repeatedly fail to understand, is that UCS2 counts unicode code points faster than UTF-8, which is completely useless because "characters" (what the end-user sees as a single sub-unit of text) often spans multiple codepoints, so counting codepoints is essentially a recipe for broken code and nothing else.
> If variable-length is so good why py3k is using UCS-4 internally?
It's not. Before 3.3 it used either UCS2 or UCS4 internally, as did Python 2, since Python 3.3 it switches the internal encoding on the fly.
> Wich means every character is exactly 32 bits. There, I said character again.
Why? If you can count characters (code points) then it's natural that you can split or substring by characters.
Try this in javascript:
Internally Fixed length encoding is much faster than variable-length encoding.> Unicode does not work that way.
It DOES.
> Splitting on characters is garbage.
You messed up Unicode in Python in so many levels. Those characters you seen in Python console is, actually not Unicode. These are just bytes in sys stdout that happens be to correctly decoded and properly displayed. You should always use the u'' for any kind of characters. '안녕하세요' is WRONG and may lead to unspecified behaviors, it depends on your source code file encoding, intepreter encoding and sys default encoding, if you display them in console it depends on the console encoding, if it's GUI or HTML widget it depends on the GUI widget or content-type encoding.
> I'm not even leaving the BMP and it's broken!
Your unicode-fu is broken. Looks like your example provided identical Korean strings, which might be ICU module in Chrome auto normalized for you.
> You can't split decomposed Korean on character boundaries.
In a broken unicode implementation, like Chrome browser v8 js engine.
> I happen to be using Python 3. It is internally using UCS-4.
For the love of BDFL read this
http://www.python.org/dev/peps/pep-0414/
http://docs.python.org/3/whatsnew/3.3.html