It’s not wrong that "🤦🏼‍♂️".length == 7 (2019)

grayhatter · on June 2, 2023

> Python 3’s approach is unambiguously the worst one, though.

Did I miss the part where he explains this take? It's made up of 5 valid unicode code units. For a language where you're not supposed to need to know the byte size semantics, the correct length should be 5. What am I missing?

The close second being 17, because length in bytes. Is another fine way to represent this data, e.g. what a successful write of some sort would look like. Network or file.

I guess I'm basing this all on the idea that it's almost always a mistake to confuse how a program manages some data, vs how a drawing lib might. Your language shouldn't concern it self with how many glyphs it needs to draw... until you actually try to draw them.

arcticbull · on June 2, 2023

It is wrong that "{emoji}".length == 7 -- but it's wrong because there's no such thing as the 'length' of a string out of context.

A string should be viewed as an opaque data type with views into it depending on what you're trying to do. You can have its length in the context of storage/retrieval/transmission (UTF-8 byte count), its length in the context of parsing (code points), its length in the context of editing (grapheme clusters) or length in the context of display (a bounding box in points when used in conjunction with a specific font and paragraph style attributes).

Claiming to provide an out-of-context length is strictly wrong because there's no such thing. This is where people get confused.

The attribute shouldn't be 'length' it should be something like 'countOfCodePoints' or exposed via a `CodePoints` type view.

It's particularly bad because so often (esp. for western programmers) 'countOfCodePoints' == 'countOfBytesInUTF8' == 'countOfGraphemeClusters' == """length""" so it's hella easy to accidentally write buggy software. Especially for people who don't know the above about unicode, which let's face it, most people don't. Not until they have to explain to their designer why they can't limit a label to '10 characters.' ("What do you mean there's no such thing as a character, and what am I trying to do?").

This is basically the tl;dr of the article but it's also my personal opinion.

All of this isn't about 'wrong' so much as 'imprecise and overloaded terminology making it easy to write buggy software through poor abstractions.'

If python explained which length you were getting, then this article wouldn't exist.

zarzavat · on June 2, 2023

There are three notions of length that make sense:

1. UTF-8 byte length

2. Code point count

3. Extended grapheme cluster count

#3 makes sense for users but it doesn’t make sense for programs which often need to work at the code point level.

I expect programming language string length to obey the law:

  len(a ++ b) = len(a) + len(b)

For example, if I concatenate a two strings, one containing an “e” and one containing a combining acute accent, then I expect the length to be longer than a string containing a precomposed ‘é’ character. It’s in fact useful if strings that look the same but have different code points have different lengths, because it tells you that they’re not the same (and maybe you forgot to normalize something etc).

Code point length is the most useful for people who are actually writing string algorithms based upon Unicode.

UTF-8 length is useful for people who are treating strings as opaque byte sequences, but in that case they should be using a bytes/buffer object and not a string object, except in very low-level languages that don’t want to pay an encoding/decoding cost.

Extended grapheme cluster count is useful for people who are constructing certain kinds of user interfaces, where the number of characters is limited for a policy rather than memory or width reason.

i.e. when length limits are imposed by human policy, grapheme cluster count is the way to go. Length limits for memory reasons should rather be in UTF-8 bytes. If you need a limit for visual width reasons then you need to go measure the string in pixels, otherwise I’m going to put a U+FDFD in there and ruin your day.

UTF-16 length can GTFO.

wtetzner · on June 2, 2023

> Code point length is the most useful for people who are actually writing string algorithms based upon Unicode.

What algorithms would you be writing against code points?

zarzavat · on June 2, 2023

Everything that you can do to a Unicode string, except concatenation, is defined in terms of code points. Normalization, case transformations, collation, regexes, layout and rendering and encoding.

For example, let’s say you want to define a “natural sort” order that sorts e.g. “A2” < “A10”. To do that you divide the string at boundaries between code points in ranges of each numeral type that you are supporting (e.g. western numerals, Arabic numerals, Chinese numerals).

wtetzner · on June 4, 2023

Wouldn't layout/rendering etc. be done in terms of grapheme clusters?

josephg · on June 2, 2023

Codepoints is best for collaborative text editing / CRDTs (diamond types, automerge, etc). We generally model the document as a big list of unicode codepoints.

We could use grapheme clusters, but the grapheme cluster boundary points change as unicode evolves, and not all systems update at the same time. Separating strings based on grapheme cluster boundaries also requires a big lookup table to be embedded in every app. Unicode codepoints are obvious, stable, and easy to work with. And they're encoding-agnostic, so there's no weird UCS2-to-UTF8-bytes conversion needed in javascript, C#, etc.

chrismorgan · on June 2, 2023

Using code points (or scalar values, I hope) just means that it’s inefficient for everyone, because now everyone has to convert indexes (well, except Python, but it has other problems), instead of only half the people.

Going UTF-8 is fairly clearly superior: it will be the wire format, even if it’s not the language’s string format, so now environments that use UTF-8 strings never need any conversions (apart from decoding escape sequences, most likely).

Much as I hate UTF-16, I would even be inclined to argue that UTF-16 was a better choice than code points, as it will reduce the amount of extra work UTF-16 environments have to do, without changing how much UTF-8 environments have to do at all; but it also has the disadvantage that validation is vanishingly rare in UTF-16, so you’re sure to end up with lone surrogate trouble at some point, whereas UTF-8 tooling has a much stronger culture of validation, so you’re much less likely to encounter it directly and can much more comfortably just declare “valid Unicode only”.

Yes, code points is a purer concept to use. I don’t care: it’s less efficient than choosing UTF-8, which adds negative-to-negligible complexity. Please, just abandon code point indexing and embrace the UTF-8.

josephg · on June 2, 2023

No, I don’t agree.

The problem with utf8 byte offsets is that it creates a data validation problem. In diamond types I’m using document positions / offsets in my wire format. With utf8 byte offsets, you can receive changes from remote peers which name invalid insertion positions. (Ie an insert inside a character, or deleting half of a codepoint). Validating remote changes received like this is a nightmare, because you need to reconstruct the whole document state to be able to tell if the edit is valid. Using Unicode codepoints makes invalid state unrepresentable. So the validation problem goes away. (You might still need to check that an insert isn’t past the end of the document, but that’s a much easier check).

Almost all application programming languages use utf16 anyway, (javascript, c#, swift, Java) so you still need to convert positions anyway. Even in rust it’s common to see line/col positions from text editors.

Using utf8 byte offsets just doesn’t really give you any benefits in exchange for making validation much harder.

chrismorgan · on June 3, 2023

The data validation concern seems fair enough.

> Almost all application programming languages use utf16 anyway, (javascript, c#, swift, Java)

Swift 5 switched to UTF-8: https://www.swift.org/blog/utf8-string/. I’m hopeful that other UTF-16 environments might eventually manage to switch to UTF-8 internally despite retaining some UTF-16 code unit semantics for compatibility; two projects have already demonstrated you can very practically do this sort of thing: Servo from fairly early on (WTF-8 despite the web’s UTF-16 code unit semantics), and PyPy since 7.1 (UTF-8 despite code point semantics, not sure what they do about surrogate code points). I know the web has largely backed away from UTF-16 and uses code point semantics (well, scalar values plus loose surrogates) on almost all new stuff, with good UTF-8 support too.

wtetzner · on June 4, 2023

> Using Unicode codepoints makes invalid state unrepresentable.

Is that true? Maybe it's not "invalid", but you might very well slice through the middle of a grapheme cluster.

chrismorgan · on June 4, 2023

I contemplated querying that myself, but decided that for CRDT editing purposes, it’s probably never practical to think about grapheme clusters. Given text in a language with syllabic grapheme clusters (e.g. Indic languages), if you start with “ba” (one syllable/grapheme cluster) and edit it to “boca” (two syllables/grapheme clusters), you could say “replaced ‘ba’ with ‘boca’”, but I’d be surprised if any CRDT did it that way if it could instead handle it as “inserted ‘oc’”, even though “oc” mightn’t make sense by itself, linguistically. But Unicode doesn’t do much in the way of defining what makes sense or not, and I don’t think there’s any coherent “bad grapheme clustering” detection algorithm. (Aside: on reflection, my Indic languages example is messier still, since -a probably represents the inherent vowel, so in “ba” → “boca” those “a”s are probably actually represented by the absence of a vowel sign code point—and if you wanted to suppress the inherent vowel, you’d need to add a virama sign. Fun stuff.)

But then again, I know that some CRDTs struggle with interleaving, and maybe grapheme-awareness could help things out in some way or other. I dunno.

josephg · on June 4, 2023

Yeah I agree. I think its inevitable that collaboratively edited documents sometimes end up with grapheme clusters that are considered invalid by some peers, simply because different peers might be using different versions of unicode. If my phone supports the polar bear emoji and yours doesn't, you'll see weird stuff instead of a polar bear. There's no getting around that.

And yes, using unicode codepoints, buggy clients might insert extra unicode characters in the middle of a grapheme cluster. But ... Eh. Fine. I'm not super bothered by that from a data validation perspective.

Why don't I have the same attitude toward invalid UTF8? I mean, the CRDT could syncronize arbitrary arrays of bytes that by agreement contain valid UTF8, and treat it as user error in the same way if that happens? Two reasons. First, because some languages (eg rust) strictly enforce that all strings must contain valid UTF8. So you can't even make a document into a String if it has invalid UTF8. We'd need a try_ codepath, which makes the API worse. Secondly, languages like javascript which store strings using UTF16 don't have an equivalent encoding for invalid UTF8 bytes at all. Javascript would have to store the document internally in a byte array or something, and decode it to a string at the frontend. And thats complex and inefficient. That all sounds much worse to me than just treating the document as a sequence of arbitrary unicode codepoints - doing which guarantees correctness and we don't need any of that mess.

chrismorgan · on June 5, 2023

> grapheme clusters that are considered invalid by some peers

I’m not sure if there’s any spec that defines any sequence of Unicode scalar values as “invalid” (though there’s certainly a lot that’s obviously wrong, like some forms of script-mixing). Grapheme cluster segmentation doesn’t concern itself with meaningfulness, but just doing something with what it has; so if you inject something into the middle of what it decided was a cluster, it’ll just split it differently.

> treating the document as a sequence of arbitrary unicode codepoints

Though I hope you validate that they are Unicode scalar values and not code points. Surrogates are terribad.

josephg · on June 7, 2023

> Though I hope you validate that they are Unicode scalar values and not code points. Surrogates are terribad.

Yes, my mistake. I do mean scalar values. I am constantly confused about the terminology for unicode. (Unicode code point? Scalar values? (Character?) Surrogate pair? Is there a term for half of a surrogate pair?)

Izkata · on June 2, 2023

Isn't that the level emoji sequences work at?

insanitybit · on June 2, 2023

I suspect primarily substring, where if you index by bytes you'll mangle the string, but if you index by codepoints everything works out.

jlokier · on June 2, 2023

You'll mangle the string if you index and search by code points too, when the string contains the emoji this article is about, or for that matter an "e" following by combining acute accent.

The string will be fine if you only move to, split and concatenate at indices outside those grapheme clusters. But that is also true when indexing by bytes or UTF-16 code units.

So in some senses, indexing by bytes is just as good as indexing by code points, but faster. Either way to avoid mangling strings you need to restrict the indices of whatever type to meaningful character boundaries.

If you have decided to avoid string indices inside grapheme clusters, there comes the awkard question of what should you do when editing text in an environment rendered with font ligatures like "->" rendered as → (rightward arrow). From one perspective, that's just a font. From another, the user sees a single character yet there are valid positions (such as from cursor movement and character search) that land mid-way through the character, and editing at those positions changes the character. Neither is clearly best for all situations.

blueflow · on June 2, 2023

UCS-2 length is also something that is still required today.

tialaramex · on June 2, 2023

UCS-2 is dead. You can't express Unicode in UCS-2. If you have old UCS-2 data you can just treat it as UTF-16, maybe check for encoding irregularities but if it was really UCS-2 correct Unicode it'll be fine.

UTF-16 length is only useful if you are moving UTF-16, perhaps for interop with other software that chose UTF-16. Remember to pass on your condolences and look forward to a day when we don't do that any more.

lxgr · on June 2, 2023

> UTF-16 length is only useful if you are moving UTF-16 [...] Remember to pass on your condolences and look forward to a day when we don't do that any more.

Java/the JVM says hi!

Arguably, "how many bytes does this string occupy in memory/on disk (e.g. in class files)" is a pretty useful thing to be able to ask.

tialaramex · on June 2, 2023

> Arguably, "how many bytes does this string occupy in memory/on disk (e.g. in class files)" is a pretty useful thing to be able to ask.

Sure. On disk those strings are (modified) UTF-8 of course, is that what you meant ?

lxgr · on June 2, 2023

On disk yes, at runtime it's still UTF-16, I believe.

This is still relevant for many databases as well, as far as I remember.

blueflow · on June 2, 2023

UCS-2 support is shipped with every modern x86 firmware. UEFI system partitions use it for VFAT name encoding.

elcaro · on June 2, 2023

This is exactly what Raku does. Neither strings nor arrays have a `.length` method because it's arbitrary.

    [0] > "\c[FACE PALM]".chars
    1
    [1] > "\c[FACE PALM]".codes
    1
    [2] > "\c[FACE PALM]".encode.bytes
    4
    [3] > "\c[FACE PALM]".encode
    utf8:0x<F0 9F A4 A6>
    [4] > "\c[FACE PALM]".encode('utf-16')
    utf16:0x<D83E DD26>

FWIW, to get the "length" of an array, the method is `.elems`.

jandrese · on June 2, 2023

I agree, "length" is an ambiguous function name. It should probably not exist and instead you have functions with units in the name: .sizeBytes, .widthCharacters, .widthResAdjPixels, and so on. Back when the world was ASCII you could get away with just .length because the numbers would always be the same, but with Unicode and all of the other complications of the modern world it isn't sufficient.

sroussey · on June 2, 2023

Seeing a .length on something makes me think it’s an iterable without looking more deeply.

In JS, for(i in emoji) will iterate twice, but for(i of emoji) will iterate once.

;)

josephg · on June 2, 2023

Yeah; I've recently noticed that almost every time I use string.length in javascript, its wrong and going to break something as soon as emoji appears. In my code, I always want to deal with either the number of codepoints or the number of UTF8 bytes. String.length gives you neither, but unfortunately it looks correct until you test with non-ASCII strings.

sroussey · on June 2, 2023

At least the “of” loop returns what is selectable via a cursor in the browser. I think…

josephg · on June 3, 2023

Yeah; its really confusing but javascript - for legacy reasons - treats strings as "arrays of UCS2 items". But javascript also implements iterator on strings which iterate through strings in unicode codepoints. Thats why "of" loops work differently from "in" loops. (for-of in javascript uses Symbol.iterator). That also means you can pull a string apart into an array of unicode codepoints using [...somestring].

ant6n · on June 2, 2023

length is not ambiguous at all. Its the number of elements in the array. A string in python3 is an array of unicode code points, so the length of a string is the number of unicode code points. If you want the number of bytes, you need to encode the string in a unicode format (utf8, utf16 or utf32) to get a bytes object, which is an array of bytes. Then you can get the length of that.

Remember, one of the big accomplishments (breaking changes) of python 3 is that all strings are Unicode, not byte arrays. If you want to view a dtring as bytes, you need to convert the string to bytes. But note the number of bytes depends on the ancoding u use (utf8, …).

tsimionescu · on June 2, 2023

Interestingly, the number of Unicode codepoints is probably the only measure of a string that is unlikely to ever be relevant to anyone in practice except when it happens to coincide with a different measure.

It can't be used to determine length in bytes (important for storage or network transmission), it can't be used to determine number of displayed characters, it can't be used to safely split a string at some position.

The only reason it has caught on is that it is easy to encode into UTF-8 and UTF-16, and that anything more interesting generally requires a language context and even a font.

I hope that future languages will get rid of this single string abstraction, and instead offer two completely separate types:

- symbol strings, which would only be usable for programming purposes and should probably be limited to ASCII

- text strings, which would be intended for human display purposes, with full Unicode support, and have APIs which answer things like "in the specified Culture, what is the length of human-recognizable characters of this string" or "what is the seventh human-recognizable characters in this string in the specified culture"

There's no reason to pay the conceptual cost of Unicode for representing field names or enums (and yes, I don't believe supporting Unicode identifiers is a good idea for a programming language; and note that I am not a native English speaker, and while I do use an alphabet, ASCII is missing some of the letters&symbols I use in my native Romanian). And there's no reason to settle for the misleading safety of Unicode code points when trying to process human displayable text.

ant6n · on June 2, 2023

The length of an array should correspond to the number of elements. Since each element is a code point, it's the most relevant number if you intend to operate on individual elements. That is, the maximum index corresponds to the length of the array.

If you care about the number of bytes, or to operate on individual bytes, then convert to utf-8,16 or 32, and operate on the bytes object. If you wish to operate on grapheme clusters, then you could probably find some 3rd party Python library that allows you to represent and operate on strings in terms of grapheme clusters.

tsimionescu · on June 3, 2023

A string is not an array, it is a chunk of text, for the vast majority of uses of strings. Exactly how that chunk of text is represented in memory and what API it should expose is the discussion we're having. My point is that it shouldn't be exposed as an array of codepoints, since array operations (lengths, indexing, taking a range) are not a very useful way of manipulating text; and even if we did expose them as an array, Unicode code points are definitely not a useful data structure for almost any purpose.

There are basically only two things that can be done with a Unicode codepoint: encode it in bytes for storage, or transform it to a glyph in a particular font or culture.

You can't even compare two sequences of Unicode codepoints for equality in many cases, since there are different ways to represent the same text with Unicode. For example the strings "thá" and "thá" are different in terms of codepoints, but most people would expect to find the second when typing in the first. Even worse, there are codepoints which are supposed to represent different characters, depending on the font being used / the locale of the display (the same Unicode codepoints are used to represent related Chinese, Japanese, or Korean characters, even when these characters are not identical between the three cultures).

avgcorrection · on June 2, 2023

Splitting into ASCII-only and Unicode would be more of a regression than a progression. And yes, the “I’m not a native speaker” is a typical pre-emptive reply, as if it matters (neither am I—doesn’t mean anything by itself).

tsimionescu · on June 3, 2023

Let me give an example of why I don't think a single unified string API works. When doing (stringA == stringB) , what do you expect to get as a result? Do you expect it to tell if you the two strings represent the exact same codepoints, or do you expect it to tell you whether they represent the same Unicode grapheme clusters, as Unicode recommends?

The answer is of course both, depending on context. You certainly don't want a fuzzy match when, say, decoding a protobuf, but you also don't want a codepoint match when looking up user input.

What most modern languages have settled on is having a Unicode codepoint array type, typically called string or text, and an array of bytes type. However, common string operations are often only provided for the text type, and not the bytes type - which becomes very annoying when doing low level work and using bytes for text, and hoping for simple text operations.

sirsinsalot · on June 2, 2023

Exactly this. People conflate unicode with encoding quite a bit. I think it was plan9 and early Go that used "runes" as a unit, where one or more runes formed a character and an array of runes could be encoded into bytes using a given encoding.

The in memory size of a rune was just an implementation detail, and while it could be important for the programmer that the size of a rune was 2 bytes, this didn't mean the length of an array of 2 runes was 4.

I always liked the rune unit, and while my memory is hazy I think it was just code points.

I think part of the issue is programmers and apis mixing bit units for in memory representation of a conceptual value mapping (unicode), conceptual characters, stored size when encoded and so on ... without firming up those abstractions with interfaces. It gets lossy.

eesmith · on June 2, 2023

FWIW, CPython uses one of several Unicode string implementation representations, depending on the code points involved:

  >>> import sys
  >>> s = 'A' * 1000
  >>> len(s)
  1000
  >>> sys.getsizeof(s)
  1049
  >>> s = '\N{SNOWMAN WITHOUT SNOW}' * 1000
  >>> len(s)
  1000
  >>> sys.getsizeof(s)
  2074
  >>> s = '\N{MUSICAL SYMBOL G CLEF}' * 1000
  >>> len(s)
  1000
  >>> sys.getsizeof(s)
  4076

See https://peps.python.org/pep-0393/ . Mentioned in the linked-to article with "CPython since 3.3 makes the same idea three-level with code point semantics".

coldtea · on June 2, 2023

>length is not ambiguous at all. Its the number of elements in the array

That's because you defined it first as "the number of elements in the array".

It is ambiguous however because that's not how people understand it when it comes to strings, and there are several counter-intuitive ways they expect it to behave.

Not to mention there might not be any "array". A string (whatever the encoding / representation) is a chunk of memory, not an array. That you can often use a method to traverse it doesn't mean it's in an array.

ant6n · on June 2, 2023

The python doc says "str" are immutable sequences of unicode code points. Since it implements __getitem__, its fair to call it an array (it has a length, and allows indexing). I couldn't find out in the documentation whether the __getitem__ is O(1), which I consider a deficiency -- this should definitely be well documented.

It doesn't really matter how some people think "how people understand" something, the documentation matters. Any string in any language is some ordered sequence of atomic text-like objects, so python's approach isn't unreasonable or unexpected, either.

coldtea · on June 2, 2023

>Since it implements __getitem__, its fair to call it an array (it has a length, and allows indexing)

Well, weren't we talking about things being "ambiguous"?

In Python we call what you describe a list. An array is something different. And people would expect something like the C (or the Java) data structure. In Python that would match the "array" lib package.

And that's just discussing the meaning of array - before we even get to whether a string is an array, and what this means.

>It doesn't really matter how some people think "how people understand" something, the documentation matters

In what universe? In practical use, clarity and non-ambiguous, least surprise names and semantics matter.

"But we clarify it in page 2000 of the documentation" is not an excuse. Nor is invoking moral or professional failings of those not reading the documentation. A good library design doesn't offload clearing ambiguity to the documentation.

>Any string in any language is some ordered sequence of atomic text-like objects

You'd be surprised. Especially since this isn't 1985 where strings were a bunch of 8-bit ascii characters, or even 1995, when widechar 16-bit arrays were "good enough" for Windows and Java, but we have not just non-ascii strings, but even variable length (e.g. utf-8) internal strings in mainstream languages.

ant6n · on June 2, 2023

Tell me a language where a string isnt an ordered sequence of elements of some atomic text-like data type. Those may have different types - like utf8 bytes, bytes, unicode code points, grapheme clusters etc. But these are all some sort of representation of text at some level. Which one a programming language uses depends on the language, and should be checked in the documentation. Its not like some obscure “check page 2000” of the doc type small print, implying that you need to read 18 tomes of language doc before u can work with the language —- no, but if u want to work with strings in any programming language, u should know what type the elements consist of.

Btw, python my try to overload the meaning of the words array and list, but the word “array” has a generic meaning in this branch of math called computer science (an ordered sequence of elements indexible in O(1)), which is how I used it here.

rhaway84773 · on June 2, 2023

Except a sequence is not an array.

So OP’s definition still does not apply to python’s definition of a string in an unambiguous manner, which was the claim they were making.

In fact, using the OP’s “unambiguous” definition leads to the conclusion that strings shouldn’t have a length function at all since it’s not an array.

ant6n · on June 2, 2023

I didnt say say a sequence is an array. I said its fair to call str objects in python arrays: https://en.m.wikipedia.org/wiki/Array_(data_type)

amake · on June 2, 2023

It's ambiguous because it's not clear what elements go into separate cells of the array.

tremon · on June 2, 2023

Why is it ambiguous? The Python documentation is pretty clear about what type of elements a string contains:

> Strings are immutable sequences of Unicode code points

(from https://docs.python.org/3/library/stdtypes.html#text-sequenc...)

ThunderSizzle · on June 2, 2023

According to python. Every language can have a different definition. C#, for example, defines the blocks to be Char objects, which is based on UTF 16:

> The Length property of a string represents the number of Char objects it contains, not the number of Unicode characters.

https://learn.microsoft.com/en-us/dotnet/csharp/programming-...

WorldMaker · on June 2, 2023

Though C# also recommends the Rune APIs for more modern/better code point handling. The Rune APIs have a bit more in common with Python 3's unicode treatment than the classic (and sometimes wrong) UTF-16 approach.

https://learn.microsoft.com/en-us/dotnet/api/system.text.run...

amake · on June 2, 2023

Did you not read the article?

charcircuit · on June 2, 2023

Counting graphene clusters is a hard problem because it depends on the font that is being used. It only exists at render time in the context of a specific client.

If the user can freely change a font it is impossible to send a string of 3 graphene clusters because you won't know if it actually will show up as 3 to client or a different number.

Izkata · on June 2, 2023

> It is wrong that "{emoji}".length == 7

I get what you're intending, but it's kinda funny because that string actually is 7

tourist2d · on June 2, 2023

The underlying size is 7, the length isn't 7.

teddyh · on June 2, 2023

Also, the length of the literal string “{emoji}” is 7.

eviks · on June 2, 2023

What do you mean there is no such thing as a character when grapheme cluster is exactly that? This is also the out-of-context , and people get confused because instead of this human context attribute they've been forced to use all the other alternatives that require more knowledge

arcticbull · on June 2, 2023

Characters in context are printable or non-printable/formatting marks right? I agree they probably meant grapheme clusters, but grapheme clusters can vary dramatically in width so the point of the conversation was to explain why a bounding box was a better approximation of their goals.

cameldrv · on June 2, 2023

They do very in width, but with a proportional font that’s true even with ASCII text. What grapheme clusters tells you is how many times you have to press the arrow key/backspace to get to the beginning of the string.

tsimionescu · on June 2, 2023

Only if the text editor made some bad assumptions. You're forgetting about non-printable characters, such as the LTR mark. These are not part of grapheme clusters (or are their own grapheme cluster), but the cursor shouldn't probably stop at them.

arcticbull · on June 2, 2023

You know it's been a long time since this conversation but I think, reflecting, it has to do with grapheme clusters not being particularly consistent across operating systems and over time. The article even has an example where one Unicode spec encodes the same 5 USVs and either 1 or 2 graphemes.

eviks · on June 2, 2023

> grapheme clusters can vary dramatically

so do 'www' and 'iii' (though less dramatically), that't not a foreign concept to designers, not sure they'd want to bound 'www' to the width of 'iii'

rightbyte · on June 2, 2023

> for western programmers

And by western you mean american, right? You can't even use ASCII in the UK --- '£'.

Symbiote · on June 2, 2023

For the UK, you also need to represent the Celtic languages. You'll need at least these letters: â, ê, î, ô, û, ŵ, ŷ, à, è, ì, ò, ù, ẁ (maybe ỳ?), á, é, í, ó, ú, ẃ (maybe ý?), ï (maybe more ¨), ...

Example: https://www.llyw.cymru/ (Welsh government homepage).

Bits of https://www.highland.gov.uk/press (Scottish Highlands council).

(I'll admit most people who don't speak these languages won't bother with the diacritics, unless it's as easy as typing English.)

arcticbull · on June 2, 2023

Yes I was using western as synonymous with ASCII users, and it’s not. My bad.

mjburgess · on June 2, 2023

£ is still just a byte

johannes1234321 · on June 2, 2023

When using latin-1/latin-15/iso-8859-1/iso-8859-15/cp1252 that statement is true. With utf-8 it is two bytes (c2 a3), if a software uses utf-16, ucs-2, etc. it may be more.

edent · on June 2, 2023

And yet it is reasonably common to see "Â£" when the UTF-8 is misinterpreted.

josefx · on June 2, 2023

Not in any modern encoding and certainly not in ASCII either. Having the highest order bit set makes that kind of problematic.

rightbyte · on June 2, 2023

'u32_pound & 0xff == u32_pound' happens to be true, ye. It doesn't make it a byte. You need the leading 0s.

mjburgess · on June 2, 2023

My mistake

tuukkah · on June 2, 2023

Nope, says UTF-8.

seagreen · on June 2, 2023

Fantastic comment, thank you.

One question I have is which of these are fixed across Unicode standards.

(a) Byte count for storage and (b) code point count are definitely fixed, Unicode provides an explicit guarantee of the latter: https://www.unicode.org/policies/stability_policy.html

(c) Visual width will change depending on the system.

But what about (d) grapheme count? If I make a microblogging site which limits post length to 144 graphemes, can my database invariants break when I upgrade my version of Unicode?

pajko · on June 2, 2023

> Not until they have to explain to their designer why they can't limit a label to '10 characters.'

Or in a single font. It's impossible to render any mixed combination of simplified Chinese, traditional Chinese and Japanese with a single font (Korean might be also involved, but not sure about that). Even in Unicode, characters might share the the same space which don't have anything common in their looks, nor in their meaning. That applies to the shared CJK space as well. Btw. Japanese has halfwidth and fullwidth characters.

thaumasiotes · on June 2, 2023

> It's impossible to render any mixed combination of simplified Chinese, traditional Chinese and Japanese with a single font (Korean might be also involved, but not sure about that)

Well, that could be phrased better. Many such mixed combinations would encounter no problems. There is "Han Unification" in Unicode, in which certain graphical forms are declared equivalent and the intent is that they display as Japanese characters if you print them in a Japanese font, but as Chinese characters if you print them in a Chinese font. 直 is a good example of how that looks; try viewing it in different fonts.

But nobody likes unification and explicit fixed forms are constantly being defined so that it's possible to talk about them. Imagine if I wanted to write "in Old English, the word for dog was hund"... except that your font automatically replaced the sequence hund with a special ligature that looks exactly like dog.

So we have separate unicode points for ⻘ (modern, CJK RADICAL BLUE) and ⾭ (old, KANGXI RADICAL BLUE), and for ⿓ (traditional Chinese, KANGXI RADICAL DRAGON), ⻰ (simplified Chinese, CJK RADICAL C-SIMPLIFIED DRAGON), and ⻯ (Japanese, CJK RADICAL J-SIMPLIFIED DRAGON). Interestingly, the dragon characters are all considered different according to the original "Han Unified" specification, where they are CJK UNIFIED IDEOGRAPH 9F8D, CJK UNIFIED IDEOGRAPH 9F99, and CJK UNIFIED IDEOGRAPH 7ADC. In contrast, there is only the one "unified" form of 直, CJK UNIFIED IDEOGRAPH 76F4, but you can refer to its Chinese form explicitly with CJK COMPATIBILITY IDEOGRAPH FAA8 and to its Japanese form with CJK COMPATIBILITY IDEOGRAPH 2F940. (My browser font fails to render either of those.)

It was never possible to rely entirely on the font to handle dealing with simplified vs traditional characters for you, for the obvious reason that their mapping is not one-to-one. In simplified Chinese, 后 means "after"† or "behind" and it also means "empress". In traditional Chinese, "after" and "behind" would be 後. And "empress" would be... 后. This means there can be no way for a traditional Chinese font to determine what it should display if you write 后.

Ultratraditional Korean hanja participate in the same variation of forms that we see between Chinese and Japanese. But it isn't normal to write Korean in hanja outside of very specific contexts. Hangul are radically different and belong to a separate part of unicode entirely.

† "After" in time. "After" in sequence is 下, "below".

prng2021 · on June 2, 2023

> A string should be viewed as an opaque data type with views into it depending on what you're trying to do.

> The attribute shouldn't be 'length' it should be something like 'countOfCodePoints' or exposed via a `CodePoints` type view.

That’s a great way to explain it and something I’ll keep with me. Thanks

GoblinSlayer · on June 2, 2023

>It's particularly bad because so often (esp. for western programmers) 'countOfCodePoints' == 'countOfBytesInUTF8' == 'countOfGraphemeClusters' == """length""" so it's hella easy to accidentally write buggy software.

Then programmers will pick a random view and assume its length equals the number of characters and bytes. Also the grapheme view will introduce an OS-dependent bug.

oaiey · on June 2, 2023

I wanted to brainfart about that length in the typical assumed usage should be 1 ignoring the inner encoding of Unicode of emoji ... But your comment was spot on and showed me my own assumption would fall into exactly this view scheme.

Thank you. Have my upvote.

BeefWellington · on June 2, 2023

5 makes perfect sense to me; the author's complaints seem kinda silly.

An area this makes sense is, what do you expect to get if you do something like:

    emoji = " "
    print(emoji[:3])

Should this throw an error because there's only one displayed "character"? Should it return only a partial codepoint by returning only the byte data for the first 3 bytes?

Modern strings are complex objects that have evolved a bit past char[] or byte[].

masklinn · on June 2, 2023

> Should this throw an error because there's only one displayed "character"?

Why should it not? You’re literally breaking the content.

Though in reality, indexing strings is a broken operation. That you’re using it at all is the core issue.

> Modern strings are complex objects that have evolved a bit past char[] or byte[].

And yet that’s exactly what you’re advocating, just with 21 bit chars.

BeefWellington · on June 2, 2023

> Why should it not? You’re literally breaking the content.

Strings are just an array of unicode codepoints rather than "characters", so all I'm doing is asking for the first three of those codepoints.

> Though in reality, indexing strings is a broken operation. That you’re using it at all is the core issue.

Substring is a broken operation? What's the justification for that idea?

masklinn · on June 2, 2023

> Strings are just an array of unicode codepoints rather than "characters", so all I'm doing is asking for the first three of those codepoints.

"Ice trays are just a pile of molecules rather than "cubes", so all I'm doing is separating those molecules", he states as he activates the igniter.

> Substring is a broken operation? What's the justification for that idea?

You take a thing and you mangle beyond recognition without regards for its purpose or meaning. That's like considering the jaws of life a normal part of opening a door to take a piss at work.

MatthewWilkes · on June 2, 2023

One of the first things the author of the article does is breaks it down into the 5 code points and explains their individual meanings.

masklinn · on June 2, 2023

Your point being, what exactly?

If the user gives you what, as far as they're considered, is a glyph. And you return a completely different glyph. You've mangled their data.

coldtea · on June 2, 2023

>Should this throw an error because there's only one displayed "character"?

Absolutely.

BeefWellington · on June 2, 2023

I think this is where the misunderstanding comes in. Python doesn't treat strings as char[] but as essentially unicode_codepoint[].

Whether this is a good idea on the whole is debatable, there's even a full PEP talking about the security concerns around doing it this way[1].

However, given this is how it works, the behaviour displayed makes complete sense to me and is the best of the bad choices presented by needing multi-byte strings.

[1]: https://peps.python.org/pep-0672/

lelanthran · on June 2, 2023

Well, an index into a string is not necessarily another string, nor a character.

chrismorgan · on June 2, 2023

> For a language where you're not supposed to need to know the byte size semantics, the correct length should be 5. What am I missing?

In the words of the article: “The choice of UTF-32 (or Python 3-style code point sequences) arises from wanting the wrong thing.”

“Not needing to know the byte size semantics” seems reasonable, but it simply isn’t a useful goal. The things it makes easier or faster (knowing how many code points there are, and O(1) indexing by code point) are things you shouldn’t be doing—and when you have to interact with the rest of the world, you now have a more expensive encoding step that is always needed, rather than just sometimes if you’d chosen UTF-8 or even UTF-16.

pierrebai · on June 2, 2023

... but maybe it simplifies and speed-up the internal processing? I haven't looked at Python 3 C implementation of strings, but that is a guess. Also, IIRC, Python 3 has the ability to keep different internal representations of strings and uses the most compact one. If all character are 7-bit ASCII, it uses bytes representation. That's what I remember from Python dev discussions long ago.

But the overall tone of the article is still bashful. Caring about the internal representation of strings and bashing UTF-32 feels lame and angry. (Especially if I'm right about the multi-rep nature of Python 3: their choice is good for most text, and they could add a UTF-8 internal rep in the future, although that would probably break enough code that expect the UTF-32 value for len() that it is not worth it.)

chrismorgan · on June 2, 2023

No. Python internally would be made much faster by working on pure UTF-8. Absolutely nothing internal to the language uses the operations that code point semantics speeds up.

Since you mention the varying internal representation of strings: that’s PEP 393 <https://peps.python.org/pep-0393/>, which landed in CPython 3.3, and it generally made things slower by introducing a lot of branching and reallocating and such, though it does speed up some cases due to having to touch less memory, and some methods due to being able to quickly rule out possibilities (e.g. str.isascii can immediately return False for a canonical UCS-2 or UCS-4 string, since if they were ASCII they’d have been of the Latin-1 kind).

PEP 393 was done because people were complaining about how much memory their UCS-4 encoding had been using.

Note also how PEP 393 retains code point semantics: Latin-1 (Unicode values 0–255), UCS-2 or UCS-4; all fixed-width encodings of code point sequences. PEP 393 does also allow a string to cache UTF-8 representation (see PyCompactUnicodeObject.{utf8, utf8_length}), choosing “UTF-8 as the recommended way of exposing strings to C code”, but I gather this isn’t used very much.

(Related: PyPy 7.1 shifted to using UTF-8 exclusively internally, and according to https://www.pypy.org/posts/2019/03/pypy-v71-released-now-use... got a “nice speed bump” out of it.)

emodendroket · on June 2, 2023

> I'm basing this all on the idea that it's almost always a mistake to confuse how a program manages some data, vs how a drawing lib might. Your language shouldn't concern it self with how many glyphs it needs to draw... until you actually try to draw them.

Well, why not? There are a lot of things that people would want to call string.length for — drawing little equals signs under text in a terminal, for a frivolous example — where that’s the whole reason they’re making the call. Off the top of my head I’m not really sure how you solve that with variable-width characters if there’s no way to separate out or count them.

coldtea · on June 2, 2023

>Well, why not?

Separation of concerns?

emodendroket · on June 2, 2023

I’m not sure what separation of concerns is served by not providing a method to show how many glyphs are in a given string.

coldtea · on June 3, 2023

Who said there shouldn't be one? The point is there should be more than one, and that not all are a language/string library-level concern.

The context is "I guess I'm basing this all on the idea that it's almost always a mistake to confuse how a program manages some data, vs how a drawing lib might. Your language shouldn't concern it self with how many glyphs it needs to draw... until you actually try to draw them."

This means that shouldn't be some generic "length" method, but appropriate separate-concerns methods (plural), some of which (e.g. regarding character width in pixels when rendered) even belong to a drawing lib and not the language at all.

The parent's point is that length (bytes), characters (count), and glyphs (size, shape) are different concerns. The latter would concern a drawing lib or a renderer, but not be a core string method (which should concern itself with the abstract notion of characters and the concrete notion of bytes).

lowdownfork · on June 2, 2023

As far as I can tell, you're only missing two things:

1. It's five "Unicode scalars," that's the name for the top-level logical unit. The term "code points" technically refers to a lower-level concept, one that varies across encodings, just not as much as the number of bytes. I didn't know that, and it's the helpful thing I learned from this article. UPDATE: And it's also not true, sorry. "code units" are the lower-level concept from the article, "code points" are a more expansive category at the same level: https://www.unicode.org/versions/Unicode10.0.0/ch03.pdf#G740...

2. The author takes it as an unstated assumption that top-level logical structure is useless because any specific usage either ignores all structure or has a point at which low-level structure comes into play. (That assumption is false: Top-level structure is useful for keeping track of what you are doing and as a sort of "common currency" for translating between different low level representations. For example, see the very first table in the article.)

globular-toast · on June 2, 2023

> The close second being 17, because length in bytes. Is another fine way to represent this data, e.g. what a successful write of some sort would look like. Network or file.

Almost. 17 is the number of bytes it occupies in memory. But you don't generally dump memory directly to disk or network. It happens to make sense (and it's convenient) for utf8 strings. But it's better to be explicit about that. Python is better. If you care about bytes, say you care about bytes:

    len(bytes(" ", "utf8")) == 17
    len(bytes(" ", "utf-16-be")) == 14
    len(bytes(" ", "utf-32-be")) == 20

benatkin · on June 2, 2023

I think he meant to bring out defensiveness with that quip. He never says that it's a big deal, just that it's the worst way to get the length of a string containing emoji, presumably of the mainstream languages.

tehsauce · on June 2, 2023

Why is it the worst way though?

imron · on June 2, 2023

> Note about Python 3 added on 2019-09-09: Originally this article claimed that Python 3 guaranteed UTF-32 validity. This was in error. Python 3 guarantees that the units of the string stay within the Unicode code point range but does not guarantee the absence of surrogates. It not only allows unpaired surrogates, which might be explained by wishing to be compatible with the value space of potentially-invalid UTF-16, but Python 3 allows materializing even surrogate pairs, which is a truly bizarre design. The previous conclusions stand with the added conclusion that Python 3 is even more messed up than I thought!

masklinn · on June 2, 2023

Because it’s never actually useful.

You can’t use that information to know how much actual space it takes (in storage) as nobody sane stores UTF-32, you can’t use it to know much much logical space it takes (aka the user’s interpretation), you can’t use it to know how much visual space it takes (not that you can ever get that), and you can’t use it to segment or process the text.

A length in codepoints gives you nothing that’s really actionable, at least not that you’d need outside of a context where you could easily obtain it otherwise.

L3viathan · on June 2, 2023

It is useful: When iterating over a string in Python (which I hope you agree _is_ useful?), you get that many parts.

masklinn · on June 2, 2023

> It is useful: When iterating over a string in Python (which I hope you agree _is_ useful?), you get that many parts.

That’s… not useful?

I can’t say I remember ever caring knowing how many items I would be getting during an iteration[0]. If I want to set an iteration limit I can just… do that, using `islice` or some such.

[0] in python anyway, in lower level language there can be a utility in order to pre-allocate an output collection

chrismorgan · on June 2, 2023

> When iterating over a string in Python (which I hope you agree _is_ useful?)

Not often. There’s almost nothing useful you can correctly do with a sequence of code points.

boxed · on June 2, 2023

But 100% of all those complaints also apply to JS, except that UTF-16 is just much more stupid than UTF-32?

globular-toast · on June 2, 2023

What is useful? What should it be? Don't say bytes because there is already an idiomatic way to get bytes: `len(bytes(s, enc))` which is both more correct and explicit.

Maybe it should just return None because the only useful thing is probably how much "space" it occupies on screen in a fixed-width font, but that's too difficult to know.

markmark · on June 2, 2023

Perhaps the part about it needing lookups of the unicode database and being dependent on the version of the database used?

lmm · on June 2, 2023

That's not true though. It just counts the number of code units, that's not version dependent. It's certainly no worse than counting the number of UTF-16 points (I'd argue it's better since it's less arbitrary - whether something is a unicode scalar is a design decision, whether something is in the BMP or not is mostly an accident of implementation).

xeeeeeeeeeeenu · on June 2, 2023

I'm not a fan of "everything you know about X is wrong" articles. Very often they try to present some little tidbit of knowledge as a revelation and mislead the reader in the process.

In this case, the tidbit is: "grapheme clusters exist and they are useful".

The misleading part is that the article draws a false equivalence between what the author calls "UTF-32 code units" and UTF-16 code units.

UTF-32 code units are Unicode code points. This is a general Unicode concept that exists in all Unicode encodings. UTF-16 code units, on the other hand, are an implementation detail of UTF-16. It is wrong to present them as equally arbitrary concepts.

chrismorgan · on June 2, 2023

> UTF-16 code units, on the other hand, are an implementation detail of UTF-16.

Would that it were only so. Instead, UTF-16 ruined Unicode for everyone with the abomination that is surrogates, and almost nothing that deals with UTF-16 actually asserts well-formedness, and ill-formed UTF-16 cannot be represented in UTF-8 or UTF-32.

UTF-32 and UTF-8 code units are truly implementation details of their encodings, as other encodings don’t need to know about them in any way. UTF-32’s code units are a trivial mapping between scalar values and 32-bit values (not four-byte values, given the big- and little-endian variants), but that still causes UTF-32 code units to be semantically distinct from Unicode scalar values. U+12345 is a Unicode scalar value and doesn’t have any “size”: it’s an abstract value. 0x00012345 is a UTF-32 code unit, a 32-bit value.

If you’re talking about encoding of Unicode scalar values, you talk about code units. Even when talking about UTF-32, the code unit/scalar value semantic distinction is worth maintaining.

RcouF1uZ4gsC · on June 2, 2023

> Instead, UTF-16 ruined Unicode for everyone with the abomination that is surrogates,

UTF-16 is a hack. Unicode originally thought 65,535 values should be enough to represent all human languages and so 16-bit fixed size characters would work. However, that proved incorrect. UTF-16 was a hack to try retrofit this onto systems that had already adopted this 16 bit character (Java, Windows NT, etc).

An entertaining summary of the situation is here:

https://www.theregister.com/2013/10/04/verity_stob_unicode/

cylemons · on June 2, 2023

I don't see what's inherently wrong with UTF-16 surrogates. If I am not wrong, a given UTF-16 codeunit is unambigously either a complete code point, a first surrogate, or a second surrogate.

Why should we expect invalid utf-16 strings to be representable in utf-8 or 32? I don't see anyone trying to represent invalid utf-8 in utf-16 or 32.

kps · on June 2, 2023

> Why should we expect invalid utf-16 strings to be representable in utf-8 or 32?

We shouldn't care. UTF-16 should just be an encoding and its internal details shouldn't leak into Unicode code points. There's just no good reason to exclude code points U+D800–U+DFFF merely because 0xD800–0xDFFF happen to be used specially in UTF-16 encoding, just like U+0080–U+00FF aren't excluded merely because (most of) 0x80–0xFF are used in UTF-8 encoding.

cylemons · on June 2, 2023

Is having a hole from U+D800 to U+DFFF such a big deal? The parent comment was specifically talking about surrogate pairs. That to me looks more like buggy implementation issue rather than standards issue.

chrismorgan · on June 2, 2023

As a hole, it would only be annoying and a performance penalty for validation. But by its very design, it will leak, and it does in such ways that it became the worst thing to ever happen to Unicode. I don’t know of a single language or library that uses UTF-16 for strings that validates strings: every last one actually uses sequences of UTF-16 code units, potentially ill-formed, and has APIs that guarantee this will leak to other systems. This has caused a lot of trouble for environments that then try to work with the vastly more sensible UTF-8 (the only credible alternative for interchange). Servo, for example, wanted to work in UTF-8, for massive memory savings and performance improvements, but the web has built on and depends on UTF-16 code unit semantics so much that they had to invent WTF-8, which is basically “UTF-8 but with that hole filled in” (well, actually it’s more complicated: half filled in, permitting only unpaired surrogates, so that you still have only one representation).

So: the problem is that the Unicode standard was compromised for the sake of a buggy encoding (they should instead have written UCS-2 off as a failed experiment), and every implementation that uses that buggy encoding is itself buggy, and that bugginess has made it into many other standards (e.g. ECMAScript).

cylemons · on June 2, 2023

Let say A is an ill formed utf-16 string with unmatched surrogates.

The problem comes when trying to convert A to utf-8. Is this the leak you are talking about?

chrismorgan · on June 2, 2023

That’s one of the two situations I speak of: when it happens in practice.

The other is… well, much the same really, but when it makes it into specs that others have to care about. The web platform demonstrates this clearly: just about everything is defined with strings being sequences of UTF-16 code units (though increasingly new stuff uses UTF-8), so then other things wanting to integrate have to decide how to handle that, if their view of strings is different: whether to be lossy (decode/encode using REPLACEMENT CHARACTER substitution on error), or inconvenient (use a different, non-native string type). Rust has certainly been afflicted by this in a number of cases and ways, generally favouring correctness.

kps · on June 2, 2023

The main issue is that it adds validation code (if one is sticking to the standard) for things that don't care about UTF-16 at all.

It does occupy 1/32 of the BMP, displaying a couple thousand potential actual characters (making them take an extra byte in UTF-8, and an extra two in UTF-16).

planede · on June 2, 2023

WTF-8 is bridging the gap here. I don't know of a UTF-32 equivalent, maybe it's impossible.

kps · on June 2, 2023

The UTF-32 equivalent is just the original UCS-4 — simply not enforcing any restrictions on the 32-bit value. Probably most code using UTF-32 does this, at least internally. (I can understand using high bits for metadata or non-Unicode points and have done so, but I don't see any reason for testing for surrogates outside of encoding/decoding UTF-16; they are indeed an abomination.)

infinitedata · on June 2, 2023

‘Would that it were only so.’

What?

chrismorgan · on June 2, 2023

“I wish it was that simple, that it was only an implementation detail.” I suppose “would that—” isn’t a particularly common form these days.

masklinn · on June 2, 2023

> UTF-32 code units are Unicode code points.

They’re not. UTF-32 code units have a 1:1 mapping to USVs, surrogates are not valid.

> It is wrong to present them as equally arbitrary concepts.

Is it? It’s not like they’re any more useful. Arguably less so, UTF-16 is at least a somewhat common storage medium.

xeeeeeeeeeeenu · on June 2, 2023

>They’re not. UTF-32 code units have a 1:1 mapping to USVs, surrogates are not valid.

This is true, although very pedantic and irrelevant to the point of my comment. The distinction only matters when you're dealing with ill-formed strings.

BTW, Python strings can store surrogates.

>Is it? It’s not like they’re any more useful. Arguably less so, UTF-16 is at least a somewhat common storage medium.

If you aren't directly dealing with UTF-16, UTF-16 code units aren't useful at all.

Code points/USVs, OTOH, are the building blocks of Unicode strings and various Unicode algorithms operate on them. They're low-level, but not useless.

avgcorrection · on June 2, 2023

> I'm not a fan of "everything you know about X is wrong" articles.

But it’s not. That style is about tone and the article doesn’t exude that kind of tone.

Do you see the author scolding programmers for being ignorant Americans, for having unknown unknowns, or for not being “professionals”? Well, me neither.

omoikane · on June 2, 2023

Ruby gives you the choice to iterate over all types, via `each_byte`, `each_char`, `each_codepoint`, or `each_grapheme_cluster`.

https://ruby-doc.org/3.2.2/String.html#class-String-label-Me...

h0l0cube · on June 2, 2023

Elixir also makes this distinction:

https://hexdocs.pm/elixir/1.12/String.html#module-code-point...

sundarurfriend · on June 2, 2023

In Julia, iterating over a string by default behaves like `each_char`.

`codeunits(str)` lets you access the underlying code units, which is bytes for the default UTF-8 encoding. (External packages implement UTF-16 and others, and there `codeunits` could return non-bytes, for eg. 16-bit values for UTF-16.)

The Unicode stdlib provides `graphemes(str)`, the equivalent of `each_grapheme_cluster`.

nerdponx · on June 2, 2023

Python would benefit from this significantly IMO.

dahfizz · on June 2, 2023

Maybe not wrong, but it's the worst option.

5 is the number of code points, and 17 is the number of bytes. Both are reasonable answers.

7 is the number of code units for utf-16. Seems like the least useful option.

Dylan16807 · on June 2, 2023

That means 7 is also a measure of bytes, just slightly more awkward. So it's roughly on par with 17.

For 5, the idea is that while you might want to iterate code points, the total number of code points is less useful than either grapheme count or byte count. I think that argument makes sense.

lmm · on June 2, 2023

> That means 7 is also a measure of bytes, just slightly more awkward.

It's not a real measure of bytes though. It's the count of bytes in an encoding scheme that is (probably) neither what you use to communicate with the outside world nor what your language runtime uses. (And certainly it's no better than 5, since that's also a measure of bytes in a particular encoding).

Dylan16807 · on June 2, 2023

Lots of systems use UTF-16 internally and externally. Counting bytes in UTF-16 is, on average, almost as useful as counting bytes in UTF-8.

I don't think just about anything communicates in UTF-32. 5 is basically just a codepoint count, and as such I don't think its usefulness rating should be between the byte counts.

lmm · on June 2, 2023

> Lots of systems use UTF-16 internally and externally. Counting bytes in UTF-16 is, on average, almost as useful as counting bytes in UTF-8.

Not my experience at all. The article points out that even languages that are committed to an UTF-16 interface prefer to use other internal storage representations, and I can't remember the last time I saw it used in a transfer format.

mjevans · on June 2, 2023

Only Windows and Java come to mind - and BOTH of those are insane for sticking to it when the entire rest of the world has moved on.

hnfong · on June 2, 2023

I hate UTF-16 and the systems that use it with a passion, but...

Windows and Java (and Javascript) adopted unicode at a time when it was thought that 64k code points would be enough for everyone. Then they prioritized backwards compatibility over anything else. Most of us have benefited from their insistence on backwards compatibility in some form or the other, so I'm really not in a position to complain about it :-/

That said, IMHO any "length" property (as opposed to `codepoints` or `bytes`) on a UTF-16 string should definitely be deprecated.

masklinn · on June 2, 2023

Windows, Java, C#, javascript, a surprising number of XML documents (though less so as time marches on thankfully), ICU I think uses UTF-16 internally (for the same historical reasons as the other 4), JOLIET file names are UCS2, some phones interpret “16-bit” SMS as UTF-16 (the spec says UCS2).

> and BOTH of those are insane for sticking to it

They don’t really have much of a choice because they exposed those semantics as part of the string interface (or for Windows the interaction is slow low level it can’t be hidden), they have performance guarantees and behaviours which matches that.

It’s also why Python uses UTF-32, and went through the entire PEP-393 / FS complication to try and stop blowing up memory left and right: the core team considered that switching strings to UTF8 was a bridge too far.

There are approximate solutions, but they come with their own costs and complications (e.g. pypy uses UTF8 strings with lazily constructed indices to emulate UTF-32 strings).

mjevans · on June 2, 2023

I'm not a Windows based programmer, but couldn't they leave the old API's in place, but make UTF-8 safe versions available for everyone and switch to that... E.G. with Win 11?

Joker_vD · on June 2, 2023

You can set the system codepage to CP_UTF8 since Win 10, I guess, although IIRC it still doesn't work for input. But a) there is a lot of programs using A() functions that don't expect that and break in subtle ways, e.g. DBCS-encoding-aware programs suddenly break because they don't expect a codepoint to span for more than 2 bytes; b) most of the sanely written programs either use UTF-16 explicitly, or use UTF-8 internally and convert between UTF-8 and UTF-16 before/after calling W() functions.

MobiusHorizons · on June 2, 2023

The JavaScript language forces utf16 (whether or not v8 uses that representation under the hood). For instance if you want to substring the indexes you pass are for utf16 codepoints

lmm · on June 2, 2023

Sure, but arguing that that's a good reason for length to count utf16 is purely circular.

temac · on June 2, 2023

I think that argument makes as much sense as saying that an engine is less useful than a car. And pretending that engine.weight should return the weight of the car.

Joker_vD · on June 2, 2023

7 cannot be a measure of bytes because a UTF-16 point takes 2 bytes, so the number has to be even. Did you mean 14?

Dylan16807 · on June 2, 2023

I meant to write 7. That's why I said "measure of" instead of "number of", because you need to multiply by 2.

MobiusHorizons · on June 2, 2023

It makes just as much sense as 17 (for utf8) in a JavaScript context, where charCodeAt(i) returns a utf-16 code point, and strings at least behave as though the implementation uses an array of uint16_t for the storage. Utf 16 is definitely not my favorite representation, but given that context (which the language imposes) 7 is an important number to be able to know.

chrismorgan · on June 2, 2023

Correction: “utf-16 code point” should read “UTF-16 code unit”.

globular-toast · on June 2, 2023

Also it's trivial to get the number of bytes in Python if that's what is wanted:

    len(bytes(" ", "utf8")) == 17

This is really the only sane way and makes it explicit which encoding you are using.

rzwitserloot · on June 2, 2023

Java loaded full unicode code point semantics into its standard `java.lang.String` class. These _are not guaranteed_ to have `O(1)` performance characteristics, because the underlying storage format is dynamically either a UTF-16-esque variant (with surrogate pairs for characters that don't fit in 16 bit), or a single-byte-per-char format if the string does not contain any non-ASCII. This has the advantage of being very very slightly more obvious, given that both methods exist and are documented:

  void main() {
    String x = "(that emoji here)";
    System.out.println("Chars: " + x.length());
    System.out.println("Codepoints: " + x.codePointCount(0, x.length()));
    System.out.println("As stream of chars (= UTF16-esque with surrogate pairs):");
    x.chars().forEach(System.out::println);
    System.out.println("As a stream of codepoints:");
    x.codePoints().forEach(System.out::println);
  }

This ends up printing:

  Chars: 7
  Codepoints: 5
  As stream of chars (= UTF16-esque with surrogate pairs):
  55358
  56614
  55356
  57340
  8205
  9794
  65039
  As a stream of codepoints:
  129318
  127996
  8205
  9794
  65039

NB: Apparently many hackernews readers know java but don't use it all that often day-to-day. The provided java snippet is vanilla valid and can be executed with `java ThatFile.java` (no need to compile it first), though it does use preview features.

The fact that the codepoint counter is a very awkward `codePointCount` call has the dubious benefit of highlighting this method loops through and therefore would be quite slow on very large strings.

ElectricalUnion · on June 2, 2023

> The provided java snippet is vanilla valid

Did not you still need the `java --source 11 ${filename_without_java_extension_because_JEP_330}` to use it? And you still need a wrapper class with a static method main in it.

JZerf · on June 2, 2023

I was a little puzzled by this compared to what I was used to with Java in the past. It looks like the grandparent's code relies on JEP 445 ( https://openjdk.org/jeps/445 ) which is a preview feature as was mentioned but it also apparently requires the very latest Java 21 which hasn't even been officially released yet.

Bjartr · on June 2, 2023

> And you still need a wrapper class with a static method main in it

One of the preview features he's using is JEP 445[1] that allows you to omit the wrapper class, as well as the arguments to main and the public and static modifiers.

[1]https://openjdk.org/jeps/445

frou_dh · on June 2, 2023

I encountered some real world unicode/emoji breakdown recently. I set my surname in a webapp to an emoji country flag because I needed a way to communicate where I was. Elsewhere in the app, it showed surnames as just their initial, e.g. "John S". There, mine showed as a featureless black flag rather than the flag I set. Presumably because that is the first codepoint of several that make up the flag.

tialaramex · on June 2, 2023

> There, mine showed as a featureless black flag rather than the flag I set. Presumably because that is the first codepoint of several that make up the flag.

The country flags are each made of two Unicode code points, which Unicode calls Regional Indicator Symbols. There are twenty six, one for each of the Latin capital letters A through Z. These are used to encode a flag by writing the ISO two letter country code from ISO-3166-1 e.g. F + R is France, you get a French flag.

Given your black flag experience, and the fact this is an English language forum, I'd guess maybe you wanted a flag for some entity that isn't a UN member state or some sort of recognised similar entity (e.g. the European flag EU symbolising the continent of Europe) and thus doesn't have an ISO two letter code, such as California or Wales. Those are built from a waving black flag plus their long ISO-3166-2 region code

hgs3 · on June 2, 2023

Python 3's approach is the most correct: Unicode defines text as a sequence of code points. UTF-whatever is an implementation detail.

chrismorgan · on June 2, 2023

Python 3’s approach snatched defeat from the jaws of victory.

They aimed to work with a nice, clean, abstract concept, untrammelled by encoding squabbles. They failed badly by choosing code units rather than scalar values (Unicode strings are sequences of scalar values, not code points—'\udead' is a valid Python string, but you can’t encode it into any UTF-* format since [U+DEAD] is not a valid Unicode string).

Then they also neglected to observe that they were optimising for something that you should practically never be doing, so that now everyone has to pay the costs. As the article summarises it part-way through: “The choice of UTF-32 (or Python 3-style code point sequences) arises from wanting the wrong thing.”

Seriously, Python 3’s approach is almost the worst of all available worlds. I loathe UTF-16 with such fiery passion that I can’t quite bring myself to say Python 3’s approach is worse than weak UTF-16, but it’s of similar badness in practical terms. The decisions were very clearly made by people that were not expert in the domain and who were caught up in a Concept of Mathematical Purity. They’ve since walked some of it back as far as they could, and I think did recognise it all as a mistake (no citation, just a vague memory of seeing such an admission), but they can’t fix it all properly without a breaking change.

arcticbull · on June 2, 2023

> Unicode defines text as a sequence of code points.

Does it? Do you have a link?

[edit] I looked up the spec and here is what it says.

> The Unicode Standard does not define what is and is not a text element in different processes; instead, it defines elements called encoded characters. An encoded character is represented by a number from 0 to 10FFFF_16, called a code point. A text element, in turn, is represented by a sequence of one or more encoded characters. [1]

The definition of 'text' in the context of Unicode seems to explicitly not be defined as a sequence of code points, but rather a more nebulous sequence of aggregations of code points. It's probably closest to a grapheme cluster but they seem to want to avoid pinning it down.

[1] https://www.unicode.org/versions/Unicode15.0.0/UnicodeStanda... p. 7 (1.3 - Text Handling), PDF page 33.

hgs3 · on June 2, 2023

Review chapter 2.2 Unicode Design Principles in the Unicode Standard: "Plain text is a pure sequence of character codes; plain Unicode-encoded text is therefore a sequence of Unicode character codes."

Text elements are an abstract concept whose definition depends upon what is being processed. It might be a grapheme, it might be word, etc...

arcticbull · on June 2, 2023

There might be something a little imprecise here: code points vs code units vs character codes.

I'm open to being wrong but I would be very surprised if they defined text as a "series of code units" the count of which can vary by encoding even for the same character. IMO in this context 'character codes' would likely be far more consistent with 'code points' and they're just trying to differentiate between styled and un-styled text. Whereas the 1.3 definition appears to be trying to make an authoritative definition of 'text.'

If we read 2.2's "character codes" as code points, then that can be multiple code points as referenced in 1.3

[edit] I originally flipped 'units' and 'codes' - cleaned it up.

hgs3 · on June 2, 2023

"Character code" is short for "character code point" or just code point. All Unicode algorithms and properties are defined in terms of the code point. UTF encodings are just a way of encoding a code point. From Unicode's perspective, you care about what is encoded (i.e. the code point) and not how it is encoded (i.e. UTF-8).

Unicode is one of the most poorly understood topics. I think the confusion stems from 1. most programming languages getting the abstraction wrong, and 2. programmers trying to reconcile their non-technical interpretation of what "character" means.

arcticbull · on June 2, 2023

I agree with everything you said, I think I'm just trying to reconcile that with the top of thread saying python was the most correct because it was returning '7 code points' and that 'UTF-whatever is an implementation detail'

But 7 is not the number of code points/USVs - that's the number of UTF-16 code units. The string is 5 USVs. If UTF-whatever is an implementation detail, wouldn't the correct answer to length be 5?

What am I missing haha.

hgs3 · on June 2, 2023

Python does return 5. JavaScript returns 7. Python is returning the number of code points, JavaScript is returning the number of UTF-16 code units.

arcticbull · on June 2, 2023

There's my mistake. Thank you. Flipped them in my head, it's ben a long day.

pjscott · on June 2, 2023

Treating Unicode strings as a sequence of code points is a completely valid thing to do, but is usually not what you actually care about when dealing with text. Really, are code points any less of an implementation detail?

planede · on June 2, 2023

Code points are what you care about when you do any kind of text-based format encoding or decoding. Any of JSON, XML, HTML, YAML or whatever is defined by sequence of code points. There is no reason to complicate these with visual representation-specific concepts.

If you have to care about the visual representation of text then you probably need to be familiar with other concepts as well.

chrismorgan · on June 2, 2023

But, given the root ancestor of this comment, it’s worth clarifying that Python’s approach to strings doesn’t help at all with things like decoding JSON/XML/HTML/YAML; what Python gives you is random access by code point index, which you won’t ever need to use in such tasks.

croes · on June 2, 2023

I think parent means most correct of the three given examples from Rust, JS and Python.

Especially because the article says that Python's take is the worst.

paulddraper · on June 2, 2023

Yes.

They are less of an implementation detail.

Grapheme > Code point > Encoding > Endianness > Media

It's all "implementations" but some are lower then others

adgjlsfhk1 · on June 2, 2023

Unicode defines text as a number of different types of things. They are sequences of codepoints, sequences of graphemes, sequences of graphime clusters. Furthermore, codepoints are different depending on how you normalize them. Accented characters can be written two different ways and have a different number of codepoints depending on how you write them (and if normalization is used)

Spivak · on June 2, 2023

Grapheme are a made up human thing that, while useful, is locale dependent. Most people when they talk about grapheme clusters mean the default "locale-independent" graphemes but it's not the only one (in Hungarian for example 'ly' is a single letter). Having the same string be two different lengths in two countries is… let's go with surprising. The common denominator where everyone computes the same number is code points.

adgjlsfhk1 · on June 2, 2023

Except they won't (if they are doing normalization). à and à have different numbers of codepoints (the first is 0x00E0, the second is 0xc3 0xa0).

duped · on June 2, 2023

There is no "most" correct, since the "length" of UTF encoded text is ambiguous. The point of the post is to highlight which semantics are the most useful and the tradeoffs.

alkonaut · on June 2, 2023

Really the correct way to design string APIs would be to not have an ambiguous "length" at all, but to always require specifying whether you want UTF8-bytes, memory bytes, code points, graphemes, whatever.

However such an API would be pretty cumbersome because for all non-edge cases (read: a western language and a reasonable encoding that language - which when looking at world demographics is a very narrow way of saying non-edge case) we just want to ignore all that fancy stuff and assume it's latin-1/ascii and use "Length" and get on with it, usually accepting that it doesn't work for many scripts or emoji.

So almost every api I have encountered has both the dangerous or ambiguous "length" and any number of the more specific counts. Good? No. But good enough, I guess.

A much worse related API that exists every where is that for parsing and formatting numbers to and from text. How that's done "depends" but most languages I have seen - unfortunately - offers a "default way". In the worst examples - looking at you .NET - this default uses the system env and assumes formatting and parsing numbers should us the OS locale. Horrible horrible idea when used in conjunction with automatic type conversions. WriteLine($"The size is {3.5}"); shouldn't print "3.5" in the US and "3,5" somewhere else.

ezfe · on June 2, 2023

Swift defaults to grapheme clusters, and allows you to request other lengths on-demand.

hardware2win · on June 2, 2023

>Horrible horrible idea when used in conjunction with automatic type conversions. WriteLine($"The size is {3.5}"); shouldn't print "3.5" in the US and "3,5" somewhere else.

Actually why?

alkonaut · on June 2, 2023

Because it’s only (maybe) a good design if it’s to be read by a human but that’s not a very general case. Instead people unknowingly make for example some exporter for a text format and write code that writes "X={x_coord}" and it passes all the unit tests and all the acceptance tests and then it breaks once it hits a French or scandinavian machine.

A great example how bad it is would be that the C# compiler repo for a very long time had tests that failed for everyone with non-US formatting.

sillysaurusx · on June 2, 2023

Measuring the length of text is really hard. Font fallback is hard. All of these things, you take for granted till you write your own game engine.

Apparently the thing to use is a library with a very strange name, which does glyph placement. I’ll go look for it.

EDIT: harfbuzz https://harfbuzz.github.io/why-do-i-need-a-shaping-engine.ht...

missblit · on June 2, 2023

Measuring the length of text is easy. Now measuring the length of text before actually printing it / running the shaping engine, now that's hard.

Jach · on June 2, 2023

Unsurprising that (at least some implementation of) Swift does the least wrong thing in returning 1. I think it's also one of the few languages that will return a count of 1 for the madness that is country flag emojis https://docs.swift.org/swift-book/documentation/the-swift-pr...

ant6n · on June 2, 2023

“Least wrong” sounds very silly. Its like programmers are discovering theres a difference between bytes, unicode code points and grapheme clusters and are unsure about how their favorite programming language represents strings, and then decide there should be some behavior that doesnt follow from the documentation.

The “length of an emoji” depends on the data type used to represent it. Its that simple and that correct.

baq · on June 2, 2023

'least wrong length method' for strings is not providing one in the first place.

namaria · on June 2, 2023

I have read somewhere that you should learn 2 or 3 programming languages from the get go. If you learn one, you run the risk of letting it's shape dictate how you mentally model computation. At some point someone who learned a dynamically typed programming language first is bound to find out why data types matter.

preommr · on June 2, 2023

> should learn 2 or 3 programming languages from the get go.

So... Java/C#/Kotlin?

omoikane · on June 2, 2023

I had a "programming languages" class that did that, where we did assignments in Python (scripting), OCaml (functional), and Prolog (logic). This is because most other classes used compiled imperative languages such as C++ and Java.

Firmwarrior · on June 2, 2023

I started with JavaScript and went to c/logic gates/assembly, and it wasn't too bad

I had to learn what pointers were, but it was OK, haha.

That said, maybe people who don't have "The Knack" would be better off learning a slightly harder language first..

namaria · on June 2, 2023

I definitely don't have talent for logic and quantitative thinking. It takes a long time and many iterations for even simple concepts in mathematics to sink in for me. I benefited greatly from learning first Scheme and also making sense of C and OS internals before trying to grok interpreted languages. I'm currently trying to get some proficiency in Go and it's been great fun!

secret-noun · on June 2, 2023

Related: I wrote a little web app that lets you see the codepoints for text like this

https://unicode-x-ray.com/?t=%F0%9F%A4%A6%F0%9F%8F%BC%E2%80%... (sorry if link looks scary, that's just the URL encoding of this emoji)

farhanhubble · on June 2, 2023

I worked on Unicode parsers for several years and the single most indispensable tool was https://r12a.github.io/app-conversion/

paddw · on June 2, 2023

I think this is a really a naming convention issue. Len() is ambiguous, you really want either num_chars() or utfxx_len(). Of course, the issue of what counts as a character is confusing in its own right...

arp242 · on June 2, 2023

In Python len() on a bytes type gives you the number of bytes, and len() on a str type gives you the number of codepoints. I think that makes sense, as strings are only intended to deal with text, and you should never have to worry about byte indexing at all.

lmm · on June 2, 2023

The argument is that indexing by codepoint is even less useful than indexing by byte.

chasontherobot · on June 2, 2023

As someone who has done both, I'd say that argument is wrong. It is much more convenient to index by code point. Indexing by bytes is almost always what you don't want to do, and leads to a lot of errors.

lmm · on June 2, 2023

What were the use cases where you found it useful to index by code point (and therefore not by grapheme cluster)?

arp242 · on June 2, 2023

In many cases it's not very useful, but there are clearly cases where it is, e.g. if you want to normalize text, compose/change emojis, stuff like that.

A codepoint is the "smallest useful addressable unit" when dealing with Unicode text, so it makes sense that's the default.

It's also comparatively expensive to address grapheme clusters.

lmm · on June 4, 2023

> In many cases it's not very useful, but there are clearly cases where it is, e.g. if you want to normalize text, compose/change emojis, stuff like that.

I can see that iterating through by codepoint could be useful for some of those cases, but I still can't see why you'd ever want to index by codepoint?

arp242 · on June 7, 2023

For the same reason you want to index anything: to slice, remove, etc. stuff. e.g. to replace a skin tone in an emoji: "str[i] = 0x1f3ff", or to insert one: "str = str[:i] + 0x1f3ff + str[i:]".

lmm · on June 12, 2023

But that's a pointlessly inefficient way to do it - surely what you want there is to iterate and transform rather than scan through and then slice? (And don't you need to group by extended grapheme cluster rather than codepoint anyway for that to make sense?)

hateful · on June 2, 2023

The wife and I have a Google sheet that we use for our shared calendar - and we put an emoji before each "event" and in top row of each day I show the Emoji for that days entries. But I need to do:

> =LEFT(F280,2) & LEFT(F281,2) & LEFT(F282,2) & LEFT(F283,2)

Since the Emojis are actually 2 bytes.

jameshart · on June 2, 2023

And as this article indicates, many - and indeed an increasing number - of emojis are many more than 2 bytes. Flags, skin tones, groups, professions…

julik · on June 2, 2023

So many of these conversations could be easier if there would not be `length()` functions but `length_in_<whats_exactly>()` functions instead.

speleding · on June 2, 2023

In ruby you have " ".codepoints.size == 5 and " ".bytes.size == 17

(It also has `length` which equals codepoints.size)

runarberg · on June 2, 2023

JavaScript is a weird one. To count UTF-16 bytes you write:

    " ".length

For unicode character count you write:

    [..." "].length

And for grapheme count (or language aware word/sentence count) you write:

    [...new Intl.Segmenter('en-US', { granularity: "grapheme" }).segment(" ")].length

For word/sentence count you swap out the granularity option.

chrismorgan · on June 2, 2023

> [..." "].length

Mind you, this is inefficient due to unnecessarily constructing an array. Here’s a more efficient version, though the difference will normally be fairly slight:

  function codePointLength(str) {
      let len = 0;
      for (const c of str) {
          len++;
      }
      return len;
  }

Kinda sad there are no equivalents to the Array methods that work on iterators. Array.prototype.reduce.call(str[Symbol.iterator](), (a, _) => a + 1, 0) doesn’t work since those methods only work on array-like types (meaning those with a length property and indexed by number—and yes, all these Array methods are explicitly defined that way deliberately so you can use them on other array-like types), not iterators.

> [...new Intl.Segmenter('en-US', { granularity: "grapheme" }).segment(" ")].length

Caution: Intl.Segmenter may not be available, so be sure to have a fallback if you want to use it. Chromium shipped it 2½ years ago, Safari 2 years ago, and Firefox hasn’t shipped it yet. (No idea why and I haven’t looked. It’s not always the case: I know of other Intl things that Firefox has shipped first.)

Freaky · on June 2, 2023

.each_codepoint.size is more efficient than .codepoints.size, as it creates a sized Enumerator that avoids needing to build an intermediate Array. For strings with only single-byte characters it reduces to returning the already-stored stored byte length.

Same goes for .each_byte.size, but for that you have the faster .bytesize method that avoids the intermediate Enumerator.

eviks · on June 2, 2023

Very good and informative article, though still not convincing that the nudge to make the shortest "len" command use the human readable size of grapheme clusters like in Swift isn't the best design approach, all the non-intuitive sizes should be special