> Why do you want to count Unicode characters? Because, there are other coun...

SnowLprd · on Nov 27, 2012

Could you not have communicated your examples without the hostility?

dietrichepp · on Nov 27, 2012

I think you misunderstand. I wasn't asking why Unicode characters should be counted instead of bytes or ASCII characters, I was asking why you would even want to count characters at all.

> I fucking hate you ascii-centric ignorant morons

Nice.

> You ignorant, arrogant fuck.

This is why I quit posting under an alias, so I wouldn't be tempted to say such things.

> display welcome message character by character fro left to right

UTF-16/UCS-4/UCS-2 doesn't solve anything here. Counting characters doesn't help. For example, imagine if you try to print Korean character-by-character. You might get some garbage like this:

    ᄋ
    아
    안
    안ᄂ
    안녀
    안녕
    안녕ᄒ
    안녕하
    안녕하ᄉ
    안녕하세
    안녕하세ᄋ
    안녕하세요

Fixed width encodings do not solve this problem, and UTF-8 does not make this problem more difficult. I am honestly curious why you would need to count characters -- at all -- except for posting to Twitter.

Splitting on characters is garbage. (This example was done in Python 3, so everything is properly encoded, and there is no need to use the 'u' prefix. The 'u' prefix is a nop in Python 3. It is only there for Python 2.x compatibility.)

    >>> x
    '안녕하세요'
    >>> x[2:4]
    'ᆫᄂ'

I tried in the Google Chrome console, too:

    > '안녕하세요'.substr(2,2)
    "하세"
    > '안녕하세요'.substr(2,2)
    "ᆫᄂ"

I'm not even leaving the BMP and it's broken! You seem to be blaming encoding issues but I don't have any issues with encoding. It doesn't matter if Chrome uses UCS-2 or Python uses UCS-4 or UCS-2, what's happening here is entirely expected, and it has everything to do with Jamo and nothing to do with encodings.

    >>> a = '안녕하세요'
    >>> b = '안녕하세요'
    # They only look the same
    >>> len(a)
    5
    >>> len(b)
    12
    >>> def p(x):
    ...     return ' '.join(
                'U+{:04X}'.format(ord(c)) for c in x)
    
    >>> print(' '.join('U+{:04X}'.format(ord(c))
              for c in b))
    >>> print(p(a))
    U+C548 U+B155 U+D558 U+C138 U+C694
    >>> print(p(b))
    U+110B U+1161 U+11AB U+1102 U+1167 U+11BC U+1112 U+1161 U+1109 U+1166 U+110B U+116D

See? Expected, broken behavior you get when splitting on character boundaries.

If you think you can split on character boundaries, you are living in an ASCII world. Unicode does not work that way. Don't think that normalization will solve anything either. (Okay, normalization solves some problems. But it is not a panacea. Some languages have grapheme clusters that cannot be precomposed.)

Fixed-width may be faster for splitting on character boundaries, but splitting on character boundaries only works in the ASCII world.

est · on Nov 27, 2012

> Counting characters doesn't help.

Why? If you can count characters (code points) then it's natural that you can split or substring by characters.

Try this in javascript:

    '안녕하세요'.substr(2,2)

Internally Fixed length encoding is much faster than variable-length encoding.

> Unicode does not work that way.

It DOES.

> Splitting on characters is garbage.

You messed up Unicode in Python in so many levels. Those characters you seen in Python console is, actually not Unicode. These are just bytes in sys stdout that happens be to correctly decoded and properly displayed. You should always use the u'' for any kind of characters. '안녕하세요' is WRONG and may lead to unspecified behaviors, it depends on your source code file encoding, intepreter encoding and sys default encoding, if you display them in console it depends on the console encoding, if it's GUI or HTML widget it depends on the GUI widget or content-type encoding.

> I'm not even leaving the BMP and it's broken!

Your unicode-fu is broken. Looks like your example provided identical Korean strings, which might be ICU module in Chrome auto normalized for you.

> You can't split decomposed Korean on character boundaries.

In a broken unicode implementation, like Chrome browser v8 js engine.

> I happen to be using Python 3. It is internally using UCS-4.

For the love of BDFL read this

http://www.python.org/dev/peps/pep-0414/

http://docs.python.org/3/whatsnew/3.3.html

codeka · on Nov 27, 2012

I'm sorry but you're wrong. I suggest you inform yourself better of the subject you're talking about before you call people "ignorant morons" next time.

dietrichepp is talking about Normalized Form D, which is a valid form of Unicode and cannot be counted using codepoints like you're doing.

Maybe you can try:

'𠀋'.substr(0,1)

est · on Nov 27, 2012

yeah sure why not.

    >>> u'𡘓'[0:1]
    u'\U00021613'

    >>> u'Hi, Mr𡘓'[-1]
    u'\U00021613

    >>> u'𠀋'[0:1]
    u'\U0002000b'

Javascript won't work because UCS2 in js engine, duh.

Actually Javascript is messed up with Unicode string and binary strings, that's why Nodejs invented Buffer

http://nodejs.org/api/buffer.html

codeka · on Nov 27, 2012

You've moved the goalposts:

  u'\U00021613'

This is a UTF-32 code unit, not a UTF-16 code unit. Even UTF-32 doesn't help when you have combining characters. I suggest you read dietrichepp's post again, he's talking about Normalization Form D.

est · on Nov 27, 2012

Okay, if it's an explicit combining character what's wrong with explicit character part counting?

You know normalized form is the norm, right?

cmccabe · on Nov 27, 2012

There are four different normalized forms in Unicode. Maybe you should enlighten us about which one you're talking about.

Or just stop embarrassing yourself.

est · on Nov 27, 2012

Reading all of your comments, so you are suggesting a Unicode object should not have len() or substring() ?

A standard like that is totally not embarrassing.

cmccabe · on Nov 27, 2012

I am suggesting that people read about unicode before designing supposedly cross-platform applications or programming languages. It's not that hard, just different than ASCII.

est · on Nov 27, 2012

Since you understand Unicode so well, can you explain dietrichepp's theory that Unicode don't need counting or offsets?

http://news.ycombinator.com/item?id=4834931

And why UCS4 (Not variable-length) is chosen in many Unicode implementations? Why wchar_t is always 32bit in posix?

cmccabe · on Nov 28, 2012

Since you understand Unicode so well, can you explain dietrichepp's theory that Unicode don't need counting or offsets?

Unicode doesn't have "characters." If you talk about characters, all you've succeeded in doing is confusing yourself. Leave characters back in ASCII-land where they belong.

Counting code points is stupid. If you like counting code points, go sit in the corner. You don't understand unicode.

You can count graphemes, but it's not going to be easy. And most of the time, I don't see why you would need to do that.

And why UCS4 (Not variable-length) is chosen in many Unicode implementations? Why wchar_t is always 32bit in posix?

wchat_t is a horrible abomination that begs for death. Nobody should use it. Use UTF-8 instead. I think Python used to use UCS4, but they don't any more. It's a horrible representation because all your strings bloat up by 4x.

cmccabe · on Nov 27, 2012

Code points aren't letters.

Consider the following sequence of code points: U+0041 U+0308 [edit: corrected sequence]

That equals this european letter: Ä

Two code points, one letter. MAGIC! You can also get the same-looking letter with a single code point using U+00C4 (unicode likes redundancy).

Not all languages have letters. Not all languages that have letters represent each one with a single code point. Please think twice before calling people "morons."

est · on Nov 27, 2012

> Two code points, one letter.

Yes I under stand there are million ways to display the same shape using various unicode. But how does that make code point counting impossible?

AND if you explictly using COMBINING DIAERESIS instead of single U+00C4, counting diaeresis separately is wrong somehow?

Why don't we make a law stating that both ae and æ is single letter?

cmccabe · on Nov 27, 2012

I am responding to your earlier post which announced that UCS2 is better than UTF8 internally because it counts unicode characters faster than UTF8. Hopefully now you understand that just taking the number of UCS2 bytes and dividing by 2 does not give you the number of letters.

Just in case you don't, let's walk through it again.

UCS-16 big-endian represenation of Ä:

0x00 0x41 0x03 0x08

Another UCS-16 big-endian representation of Ä:

0x00 0xc4

If you look at the number of bytes, the first example has 4. It represents one letter. The second example has 2. It also represents one letter. Conclusion: UCS2 does not "count unicode characters faster than UTF8." You still have to look at every byte to see how many letters you have, same as in UTF-8.

Do you grasp this? If not, maybe you are one of those "ascii-centric ignorant morons" I keep hearing so much about.

est · on Nov 27, 2012

Name one Unicode implementation which shows utf16 `0x00 0x41 0x03 0x08` as length 1.

U+4100 U+0803 is two code points by defintion. Thus length == 2.

cmccabe · on Nov 27, 2012

http://stackoverflow.com/questions/4579215/cross-platform-it...

est · on Nov 27, 2012

Yes? `System.Globalizatiion` or `ICU` can count grapheme, what's your point?

Those libraries are equivalent to normalize( utf16 `0x00 0x41 0x03 0x08`) == length 1

Back to my top comment, I stated that UCS2 counts faster than UTf8 internally, because every BMP code point is just two bytes, what's wrong here? If variable-length is so good why py3k is using UCS-4 internally? (Wich means every character is exactly 32 bits. There, I said character again.)

masklinn · on Nov 27, 2012

> Back to my top comment, I stated that UCS2 counts faster than UTf8 internally

The part cmccabe tries to explain, and which you repeatedly fail to understand, is that UCS2 counts unicode code points faster than UTF-8, which is completely useless because "characters" (what the end-user sees as a single sub-unit of text) often spans multiple codepoints, so counting codepoints is essentially a recipe for broken code and nothing else.

> If variable-length is so good why py3k is using UCS-4 internally?

It's not. Before 3.3 it used either UCS2 or UCS4 internally, as did Python 2, since Python 3.3 it switches the internal encoding on the fly.

> Wich means every character is exactly 32 bits. There, I said character again.

yeah and you're wrong again.

est · on Nov 28, 2012

> yeah and you're wrong again.

http://en.wikipedia.org/wiki/UTF-32

UTF-32 (or UCS-4) is a protocol to encode Unicode characters that uses exactly 32 bits per Unicode code point.

http://www.unicode.org/faq/utf_bom.html#utf32-1

est · on Nov 27, 2012

> See? Expected, broken behavior you get when splitting on character boundaries.

Yeah, like your Jamo trick is complex for a native CJK speaker.

Thought Jamo is hard? Check out Ideographic Description Sequence. We have like millions of 偏旁部首笔画 that you can freestyle combine with.

And the fun is the relative length of glypes, 土 and 士 is different, only because one line is longer that the other. How would you distinguish that?

But you know what your problem is?

It's like arguing with you that you think ส็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็ is only one character.

IMPOSSIBU?!!!???

And because U+202e exists on the lolternet so we deprive your ability to count 99% normal CJK characters???!??!111!

Combination characters is normalized to single character in most cases, and should be countable and indexable separately.

If you type combination characters EXPLICITLY, they will be counted with each combination, naturally, what's wrong with that?

Or else why don't we abandon Unicode, every country deal with their own weird glype composition shit?

fusiongyro · on Nov 27, 2012

By being unnecessarily insulting you're degrading your own position, and your argument is collapsing under the weight of your anger and sarcasm. I see Dietrich and Colin working with definitions of code point, character and glyph that illuminates why counting one way will lead to problems when you slip into thinking you're counting the other way. Then in your much-too-fired-up responses you conflate them again, and muddy us back to square one.

It seems to me you're deriding us for being native speakers of languages with alphabets, and also deriding us for wanting APIs that prevent developers from alphabet-language backgrounds from making the mistakes our assumptions would incline us towards. You're going to have to decide if you're angry because you like the "simplicity" of UTF-16, because we don't speak a CJK language as well as you do (maybe Dietrich or Colin does; I have no idea) or because you're just angry and this is where you've come to blow off steam. If it's the third, I hope you'll try Reddit first next time, since this kind of behavior seems to be a lot more acceptable there than here.

est · on Nov 27, 2012

For fuck's sake I am not defending UTF16's simplicity, I am defending that:

fixed width can count code points (I worded it as "character") faster than variable-length

Then his dietrichepp tries to educate me two code points combined should be treated equaly with another single code point, WTF y u no normalization?

Downvote me as you like, but you can't change the fact that UCS4 is used internally in Unicode systems.

Any reason other than for faster code point counting?

-----------------

dietrichepp also offended me that unicode characters should not count or offset. QTF:

> Why do you want to count Unicode characters? Why do you care if it is fast to do so? Why would you ever need to use character-based string indexing?

Dylan16807 · on Nov 27, 2012

What in the world is your goal in this conversation? In your impotent rage you've only established that it's useless to count code points, completely counter to your original point in favor of UCS-2.