> Python 3’s approach is unambiguously the worst one, though.
Did I miss the part where he explains this take? It's made up of 5 valid unicode code units. For a language where you're not supposed to need to know the byte size semantics, the correct length should be 5. What am I missing?
The close second being 17, because length in bytes. Is another fine way to represent this data, e.g. what a successful write of some sort would look like. Network or file.
I guess I'm basing this all on the idea that it's almost always a mistake to confuse how a program manages some data, vs how a drawing lib might. Your language shouldn't concern it self with how many glyphs it needs to draw... until you actually try to draw them.
It is wrong that "{emoji}".length == 7 -- but it's wrong because there's no such thing as the 'length' of a string out of context.
A string should be viewed as an opaque data type with views into it depending on what you're trying to do. You can have its length in the context of storage/retrieval/transmission (UTF-8 byte count), its length in the context of parsing (code points), its length in the context of editing (grapheme clusters) or length in the context of display (a bounding box in points when used in conjunction with a specific font and paragraph style attributes).
Claiming to provide an out-of-context length is strictly wrong because there's no such thing. This is where people get confused.
The attribute shouldn't be 'length' it should be something like 'countOfCodePoints' or exposed via a `CodePoints` type view.
It's particularly bad because so often (esp. for western programmers) 'countOfCodePoints' == 'countOfBytesInUTF8' == 'countOfGraphemeClusters' == """length""" so it's hella easy to accidentally write buggy software. Especially for people who don't know the above about unicode, which let's face it, most people don't. Not until they have to explain to their designer why they can't limit a label to '10 characters.' ("What do you mean there's no such thing as a character, and what am I trying to do?").
This is basically the tl;dr of the article but it's also my personal opinion.
All of this isn't about 'wrong' so much as 'imprecise and overloaded terminology making it easy to write buggy software through poor abstractions.'
If python explained which length you were getting, then this article wouldn't exist.
There are three notions of length that make sense:
1. UTF-8 byte length
2. Code point count
3. Extended grapheme cluster count
#3 makes sense for users but it doesn’t make sense for programs which often need to work at the code point level.
I expect programming language string length to obey the law:
len(a ++ b) = len(a) + len(b)
For example, if I concatenate a two strings, one containing an “e” and one containing a combining acute accent, then I expect the length to be longer than a string containing a precomposed ‘é’ character. It’s in fact useful if strings that look the same but have different code points have different lengths, because it tells you that they’re not the same (and maybe you forgot to normalize something etc).
Code point length is the most useful for people who are actually writing string algorithms based upon Unicode.
UTF-8 length is useful for people who are treating strings as opaque byte sequences, but in that case they should be using a bytes/buffer object and not a string object, except in very low-level languages that don’t want to pay an encoding/decoding cost.
Extended grapheme cluster count is useful for people who are constructing certain kinds of user interfaces, where the number of characters is limited for a policy rather than memory or width reason.
i.e. when length limits are imposed by human policy, grapheme cluster count is the way to go. Length limits for memory reasons should rather be in UTF-8 bytes. If you need a limit for visual width reasons then you need to go measure the string in pixels, otherwise I’m going to put a U+FDFD in there and ruin your day.
Everything that you can do to a Unicode string, except concatenation, is defined in terms of code points. Normalization, case transformations, collation, regexes, layout and rendering and encoding.
For example, let’s say you want to define a “natural sort” order that sorts e.g. “A2” < “A10”. To do that you divide the string at boundaries between code points in ranges of each numeral type that you are supporting (e.g. western numerals, Arabic numerals, Chinese numerals).
Codepoints is best for collaborative text editing / CRDTs (diamond types, automerge, etc). We generally model the document as a big list of unicode codepoints.
We could use grapheme clusters, but the grapheme cluster boundary points change as unicode evolves, and not all systems update at the same time. Separating strings based on grapheme cluster boundaries also requires a big lookup table to be embedded in every app. Unicode codepoints are obvious, stable, and easy to work with. And they're encoding-agnostic, so there's no weird UCS2-to-UTF8-bytes conversion needed in javascript, C#, etc.
Using code points (or scalar values, I hope) just means that it’s inefficient for everyone, because now everyone has to convert indexes (well, except Python, but it has other problems), instead of only half the people.
Going UTF-8 is fairly clearly superior: it will be the wire format, even if it’s not the language’s string format, so now environments that use UTF-8 strings never need any conversions (apart from decoding escape sequences, most likely).
Much as I hate UTF-16, I would even be inclined to argue that UTF-16 was a better choice than code points, as it will reduce the amount of extra work UTF-16 environments have to do, without changing how much UTF-8 environments have to do at all; but it also has the disadvantage that validation is vanishingly rare in UTF-16, so you’re sure to end up with lone surrogate trouble at some point, whereas UTF-8 tooling has a much stronger culture of validation, so you’re much less likely to encounter it directly and can much more comfortably just declare “valid Unicode only”.
Yes, code points is a purer concept to use. I don’t care: it’s less efficient than choosing UTF-8, which adds negative-to-negligible complexity. Please, just abandon code point indexing and embrace the UTF-8.
The problem with utf8 byte offsets is that it creates a data validation problem. In diamond types I’m using document positions / offsets in my wire format. With utf8 byte offsets, you can receive changes from remote peers which name invalid insertion positions. (Ie an insert inside a character, or deleting half of a codepoint). Validating remote changes received like this is a nightmare, because you need to reconstruct the whole document state to be able to tell if the edit is valid. Using Unicode codepoints makes invalid state unrepresentable. So the validation problem goes away. (You might still need to check that an insert isn’t past the end of the document, but that’s a much easier check).
Almost all application programming languages use utf16 anyway, (javascript, c#, swift, Java) so you still need to convert positions anyway. Even in rust it’s common to see line/col positions from text editors.
Using utf8 byte offsets just doesn’t really give you any benefits in exchange for making validation much harder.
> Almost all application programming languages use utf16 anyway, (javascript, c#, swift, Java)
Swift 5 switched to UTF-8: https://www.swift.org/blog/utf8-string/. I’m hopeful that other UTF-16 environments might eventually manage to switch to UTF-8 internally despite retaining some UTF-16 code unit semantics for compatibility; two projects have already demonstrated you can very practically do this sort of thing: Servo from fairly early on (WTF-8 despite the web’s UTF-16 code unit semantics), and PyPy since 7.1 (UTF-8 despite code point semantics, not sure what they do about surrogate code points). I know the web has largely backed away from UTF-16 and uses code point semantics (well, scalar values plus loose surrogates) on almost all new stuff, with good UTF-8 support too.
I contemplated querying that myself, but decided that for CRDT editing purposes, it’s probably never practical to think about grapheme clusters. Given text in a language with syllabic grapheme clusters (e.g. Indic languages), if you start with “ba” (one syllable/grapheme cluster) and edit it to “boca” (two syllables/grapheme clusters), you could say “replaced ‘ba’ with ‘boca’”, but I’d be surprised if any CRDT did it that way if it could instead handle it as “inserted ‘oc’”, even though “oc” mightn’t make sense by itself, linguistically. But Unicode doesn’t do much in the way of defining what makes sense or not, and I don’t think there’s any coherent “bad grapheme clustering” detection algorithm. (Aside: on reflection, my Indic languages example is messier still, since -a probably represents the inherent vowel, so in “ba” → “boca” those “a”s are probably actually represented by the absence of a vowel sign code point—and if you wanted to suppress the inherent vowel, you’d need to add a virama sign. Fun stuff.)
But then again, I know that some CRDTs struggle with interleaving, and maybe grapheme-awareness could help things out in some way or other. I dunno.
Yeah I agree. I think its inevitable that collaboratively edited documents sometimes end up with grapheme clusters that are considered invalid by some peers, simply because different peers might be using different versions of unicode. If my phone supports the polar bear emoji and yours doesn't, you'll see weird stuff instead of a polar bear. There's no getting around that.
And yes, using unicode codepoints, buggy clients might insert extra unicode characters in the middle of a grapheme cluster. But ... Eh. Fine. I'm not super bothered by that from a data validation perspective.
Why don't I have the same attitude toward invalid UTF8? I mean, the CRDT could syncronize arbitrary arrays of bytes that by agreement contain valid UTF8, and treat it as user error in the same way if that happens? Two reasons. First, because some languages (eg rust) strictly enforce that all strings must contain valid UTF8. So you can't even make a document into a String if it has invalid UTF8. We'd need a try_ codepath, which makes the API worse. Secondly, languages like javascript which store strings using UTF16 don't have an equivalent encoding for invalid UTF8 bytes at all. Javascript would have to store the document internally in a byte array or something, and decode it to a string at the frontend. And thats complex and inefficient. That all sounds much worse to me than just treating the document as a sequence of arbitrary unicode codepoints - doing which guarantees correctness and we don't need any of that mess.
> grapheme clusters that are considered invalid by some peers
I’m not sure if there’s any spec that defines any sequence of Unicode scalar values as “invalid” (though there’s certainly a lot that’s obviously wrong, like some forms of script-mixing). Grapheme cluster segmentation doesn’t concern itself with meaningfulness, but just doing something with what it has; so if you inject something into the middle of what it decided was a cluster, it’ll just split it differently.
> treating the document as a sequence of arbitrary unicode codepoints
Though I hope you validate that they are Unicode scalar values and not code points. Surrogates are terribad.
> Though I hope you validate that they are Unicode scalar values and not code points. Surrogates are terribad.
Yes, my mistake. I do mean scalar values. I am constantly confused about the terminology for unicode. (Unicode code point? Scalar values? (Character?) Surrogate pair? Is there a term for half of a surrogate pair?)
You'll mangle the string if you index and search by code points too, when the string contains the emoji this article is about, or for that matter an "e" following by combining acute accent.
The string will be fine if you only move to, split and concatenate at indices outside those grapheme clusters. But that is also true when indexing by bytes or UTF-16 code units.
So in some senses, indexing by bytes is just as good as indexing by code points, but faster. Either way to avoid mangling strings you need to restrict the indices of whatever type to meaningful character boundaries.
If you have decided to avoid string indices inside grapheme clusters, there comes the awkard question of what should you do when editing text in an environment rendered with font ligatures like "->" rendered as → (rightward arrow). From one perspective, that's just a font. From another, the user sees a single character yet there are valid positions (such as from cursor movement and character search) that land mid-way through the character, and editing at those positions changes the character. Neither is clearly best for all situations.
UCS-2 is dead. You can't express Unicode in UCS-2. If you have old UCS-2 data you can just treat it as UTF-16, maybe check for encoding irregularities but if it was really UCS-2 correct Unicode it'll be fine.
UTF-16 length is only useful if you are moving UTF-16, perhaps for interop with other software that chose UTF-16. Remember to pass on your condolences and look forward to a day when we don't do that any more.
> UTF-16 length is only useful if you are moving UTF-16 [...] Remember to pass on your condolences and look forward to a day when we don't do that any more.
Java/the JVM says hi!
Arguably, "how many bytes does this string occupy in memory/on disk (e.g. in class files)" is a pretty useful thing to be able to ask.
I agree, "length" is an ambiguous function name. It should probably not exist and instead you have functions with units in the name: .sizeBytes, .widthCharacters, .widthResAdjPixels, and so on. Back when the world was ASCII you could get away with just .length because the numbers would always be the same, but with Unicode and all of the other complications of the modern world it isn't sufficient.
Yeah; I've recently noticed that almost every time I use string.length in javascript, its wrong and going to break something as soon as emoji appears. In my code, I always want to deal with either the number of codepoints or the number of UTF8 bytes. String.length gives you neither, but unfortunately it looks correct until you test with non-ASCII strings.
Yeah; its really confusing but javascript - for legacy reasons - treats strings as "arrays of UCS2 items". But javascript also implements iterator on strings which iterate through strings in unicode codepoints. Thats why "of" loops work differently from "in" loops. (for-of in javascript uses Symbol.iterator). That also means you can pull a string apart into an array of unicode codepoints using [...somestring].
length is not ambiguous at all. Its the number of elements in the array. A string in python3 is an array of unicode code points, so the length of a string is the number of unicode code points. If you want the number of bytes, you need to encode the string in a unicode format (utf8, utf16 or utf32) to get a bytes object, which is an array of bytes. Then you can get the length of that.
Remember, one of the big accomplishments (breaking changes) of python 3 is that all strings are Unicode, not byte arrays. If you want to view a dtring as bytes, you need to convert the string to bytes. But note the number of bytes depends on the ancoding u use (utf8, …).
Interestingly, the number of Unicode codepoints is probably the only measure of a string that is unlikely to ever be relevant to anyone in practice except when it happens to coincide with a different measure.
It can't be used to determine length in bytes (important for storage or network transmission), it can't be used to determine number of displayed characters, it can't be used to safely split a string at some position.
The only reason it has caught on is that it is easy to encode into UTF-8 and UTF-16, and that anything more interesting generally requires a language context and even a font.
I hope that future languages will get rid of this single string abstraction, and instead offer two completely separate types:
- symbol strings, which would only be usable for programming purposes and should probably be limited to ASCII
- text strings, which would be intended for human display purposes, with full Unicode support, and have APIs which answer things like "in the specified Culture, what is the length of human-recognizable characters of this string" or "what is the seventh human-recognizable characters in this string in the specified culture"
There's no reason to pay the conceptual cost of Unicode for representing field names or enums (and yes, I don't believe supporting Unicode identifiers is a good idea for a programming language; and note that I am not a native English speaker, and while I do use an alphabet, ASCII is missing some of the letters&symbols I use in my native Romanian). And there's no reason to settle for the misleading safety of Unicode code points when trying to process human displayable text.
The length of an array should correspond to the number of elements. Since each element is a code point, it's the most relevant number if you intend to operate on individual elements. That is, the maximum index corresponds to the length of the array.
If you care about the number of bytes, or to operate on individual bytes, then convert to utf-8,16 or 32, and operate on the bytes object. If you wish to operate on grapheme clusters, then you could probably find some 3rd party Python library that allows you to represent and operate on strings in terms of grapheme clusters.
A string is not an array, it is a chunk of text, for the vast majority of uses of strings. Exactly how that chunk of text is represented in memory and what API it should expose is the discussion we're having. My point is that it shouldn't be exposed as an array of codepoints, since array operations (lengths, indexing, taking a range) are not a very useful way of manipulating text; and even if we did expose them as an array, Unicode code points are definitely not a useful data structure for almost any purpose.
There are basically only two things that can be done with a Unicode codepoint: encode it in bytes for storage, or transform it to a glyph in a particular font or culture.
You can't even compare two sequences of Unicode codepoints for equality in many cases, since there are different ways to represent the same text with Unicode. For example the strings "thá" and "thá" are different in terms of codepoints, but most people would expect to find the second when typing in the first. Even worse, there are codepoints which are supposed to represent different characters, depending on the font being used / the locale of the display (the same Unicode codepoints are used to represent related Chinese, Japanese, or Korean characters, even when these characters are not identical between the three cultures).
Splitting into ASCII-only and Unicode would be more of a regression than a progression. And yes, the “I’m not a native speaker” is a typical pre-emptive reply, as if it matters (neither am I—doesn’t mean anything by itself).
Let me give an example of why I don't think a single unified string API works. When doing (stringA == stringB) , what do you expect to get as a result? Do you expect it to tell if you the two strings represent the exact same codepoints, or do you expect it to tell you whether they represent the same Unicode grapheme clusters, as Unicode recommends?
The answer is of course both, depending on context. You certainly don't want a fuzzy match when, say, decoding a protobuf, but you also don't want a codepoint match when looking up user input.
What most modern languages have settled on is having a Unicode codepoint array type, typically called string or text, and an array of bytes type. However, common string operations are often only provided for the text type, and not the bytes type - which becomes very annoying when doing low level work and using bytes for text, and hoping for simple text operations.
Exactly this. People conflate unicode with encoding quite a bit. I think it was plan9 and early Go that used "runes" as a unit, where one or more runes formed a character and an array of runes could be encoded into bytes using a given encoding.
The in memory size of a rune was just an implementation detail, and while it could be important for the programmer that the size of a rune was 2 bytes, this didn't mean the length of an array of 2 runes was 4.
I always liked the rune unit, and while my memory is hazy I think it was just code points.
I think part of the issue is programmers and apis mixing bit units for in memory representation of a conceptual value mapping (unicode), conceptual characters, stored size when encoded and so on ... without firming up those abstractions with interfaces. It gets lossy.
FWIW, CPython uses one of several Unicode string implementation representations, depending on the code points involved:
>>> import sys
>>> s = 'A' * 1000
>>> len(s)
1000
>>> sys.getsizeof(s)
1049
>>> s = '\N{SNOWMAN WITHOUT SNOW}' * 1000
>>> len(s)
1000
>>> sys.getsizeof(s)
2074
>>> s = '\N{MUSICAL SYMBOL G CLEF}' * 1000
>>> len(s)
1000
>>> sys.getsizeof(s)
4076
See https://peps.python.org/pep-0393/ . Mentioned in the linked-to article with "CPython since 3.3 makes the same idea three-level with code point semantics".
>length is not ambiguous at all. Its the number of elements in the array
That's because you defined it first as "the number of elements in the array".
It is ambiguous however because that's not how people understand it when it comes to strings, and there are several counter-intuitive ways they expect it to behave.
Not to mention there might not be any "array". A string (whatever the encoding / representation) is a chunk of memory, not an array. That you can often use a method to traverse it doesn't mean it's in an array.
The python doc says "str" are immutable sequences of unicode code points. Since it implements __getitem__, its fair to call it an array (it has a length, and allows indexing). I couldn't find out in the documentation whether the __getitem__ is O(1), which I consider a deficiency -- this should definitely be well documented.
It doesn't really matter how some people think "how people understand" something, the documentation matters. Any string in any language is some ordered sequence of atomic text-like objects, so python's approach isn't unreasonable or unexpected, either.
>Since it implements __getitem__, its fair to call it an array (it has a length, and allows indexing)
Well, weren't we talking about things being "ambiguous"?
In Python we call what you describe a list. An array is something different. And people would expect something like the C (or the Java) data structure. In Python that would match the "array" lib package.
And that's just discussing the meaning of array - before we even get to whether a string is an array, and what this means.
>It doesn't really matter how some people think "how people understand" something, the documentation matters
In what universe? In practical use, clarity and non-ambiguous, least surprise names and semantics matter.
"But we clarify it in page 2000 of the documentation" is not an excuse. Nor is invoking moral or professional failings of those not reading the documentation. A good library design doesn't offload clearing ambiguity to the documentation.
>Any string in any language is some ordered sequence of atomic text-like objects
You'd be surprised. Especially since this isn't 1985 where strings were a bunch of 8-bit ascii characters, or even 1995, when widechar 16-bit arrays were "good enough" for Windows and Java, but we have not just non-ascii strings, but even variable length (e.g. utf-8) internal strings in mainstream languages.
Tell me a language where a string isnt an ordered sequence of elements of some atomic text-like data type. Those may have different types - like utf8 bytes, bytes, unicode code points, grapheme clusters etc. But these are all some sort of representation of text at some level. Which one a programming language uses depends on the language, and should be checked in the documentation. Its not like some obscure “check page 2000” of the doc type small print, implying that you need to read 18 tomes of language doc before u can work with the language —- no, but if u want to work with strings in any programming language, u should know what type the elements consist of.
Btw, python my try to overload the meaning of the words array and list, but the word “array” has a generic meaning in this branch of math called computer science (an ordered sequence of elements indexible in O(1)), which is how I used it here.
Though C# also recommends the Rune APIs for more modern/better code point handling. The Rune APIs have a bit more in common with Python 3's unicode treatment than the classic (and sometimes wrong) UTF-16 approach.
Counting graphene clusters is a hard problem because it depends on the font that is being used. It only exists at render time in the context of a specific client.
If the user can freely change a font it is impossible to send a string of 3 graphene clusters because you won't know if it actually will show up as 3 to client or a different number.
What do you mean there is no such thing as a character when grapheme cluster is exactly that? This is also the out-of-context , and people get confused because instead of this human context attribute they've been forced to use all the other alternatives that require more knowledge
Characters in context are printable or non-printable/formatting marks right? I agree they probably meant grapheme clusters, but grapheme clusters can vary dramatically in width so the point of the conversation was to explain why a bounding box was a better approximation of their goals.
They do very in width, but with a proportional font that’s true even with ASCII text. What grapheme clusters tells you is how many times you have to press the arrow key/backspace to get to the beginning of the string.
Only if the text editor made some bad assumptions. You're forgetting about non-printable characters, such as the LTR mark. These are not part of grapheme clusters (or are their own grapheme cluster), but the cursor shouldn't probably stop at them.
You know it's been a long time since this conversation but I think, reflecting, it has to do with grapheme clusters not being particularly consistent across operating systems and over time. The article even has an example where one Unicode spec encodes the same 5 USVs and either 1 or 2 graphemes.
For the UK, you also need to represent the Celtic languages. You'll need at least these letters: â, ê, î, ô, û, ŵ, ŷ, à, è, ì, ò, ù, ẁ (maybe ỳ?), á, é, í, ó, ú, ẃ (maybe ý?), ï (maybe more ¨), ...
When using latin-1/latin-15/iso-8859-1/iso-8859-15/cp1252 that statement is true. With utf-8 it is two bytes (c2 a3), if a software uses utf-16, ucs-2, etc. it may be more.
(c) Visual width will change depending on the system.
But what about (d) grapheme count? If I make a microblogging site which limits post length to 144 graphemes, can my database invariants break when I upgrade my version of Unicode?
> Not until they have to explain to their designer why they can't limit a label to '10 characters.'
Or in a single font. It's impossible to render any mixed combination of simplified Chinese, traditional Chinese and Japanese with a single font (Korean might be also involved, but not sure about that). Even in Unicode, characters might share the the same space which don't have anything common in their looks, nor in their meaning. That applies to the shared CJK space as well.
Btw. Japanese has halfwidth and fullwidth characters.
> It's impossible to render any mixed combination of simplified Chinese, traditional Chinese and Japanese with a single font (Korean might be also involved, but not sure about that)
Well, that could be phrased better. Many such mixed combinations would encounter no problems. There is "Han Unification" in Unicode, in which certain graphical forms are declared equivalent and the intent is that they display as Japanese characters if you print them in a Japanese font, but as Chinese characters if you print them in a Chinese font. 直 is a good example of how that looks; try viewing it in different fonts.
But nobody likes unification and explicit fixed forms are constantly being defined so that it's possible to talk about them. Imagine if I wanted to write "in Old English, the word for dog was hund"... except that your font automatically replaced the sequence hund with a special ligature that looks exactly like dog.
So we have separate unicode points for ⻘ (modern, CJK RADICAL BLUE) and ⾭ (old, KANGXI RADICAL BLUE), and for ⿓ (traditional Chinese, KANGXI RADICAL DRAGON), ⻰ (simplified Chinese, CJK RADICAL C-SIMPLIFIED DRAGON), and ⻯ (Japanese, CJK RADICAL J-SIMPLIFIED DRAGON). Interestingly, the dragon characters are all considered different according to the original "Han Unified" specification, where they are CJK UNIFIED IDEOGRAPH 9F8D, CJK UNIFIED IDEOGRAPH 9F99, and CJK UNIFIED IDEOGRAPH 7ADC. In contrast, there is only the one "unified" form of 直, CJK UNIFIED IDEOGRAPH 76F4, but you can refer to its Chinese form explicitly with CJK COMPATIBILITY IDEOGRAPH FAA8 and to its Japanese form with CJK COMPATIBILITY IDEOGRAPH 2F940. (My browser font fails to render either of those.)
It was never possible to rely entirely on the font to handle dealing with simplified vs traditional characters for you, for the obvious reason that their mapping is not one-to-one. In simplified Chinese, 后 means "after"† or "behind" and it also means "empress". In traditional Chinese, "after" and "behind" would be 後. And "empress" would be... 后. This means there can be no way for a traditional Chinese font to determine what it should display if you write 后.
Ultratraditional Korean hanja participate in the same variation of forms that we see between Chinese and Japanese. But it isn't normal to write Korean in hanja outside of very specific contexts. Hangul are radically different and belong to a separate part of unicode entirely.
† "After" in time. "After" in sequence is 下, "below".
>It's particularly bad because so often (esp. for western programmers) 'countOfCodePoints' == 'countOfBytesInUTF8' == 'countOfGraphemeClusters' == """length""" so it's hella easy to accidentally write buggy software.
Then programmers will pick a random view and assume its length equals the number of characters and bytes. Also the grapheme view will introduce an OS-dependent bug.
I wanted to brainfart about that length in the typical assumed usage should be 1 ignoring the inner encoding of Unicode of emoji ... But your comment was spot on and showed me my own assumption would fall into exactly this view scheme.
5 makes perfect sense to me; the author's complaints seem kinda silly.
An area this makes sense is, what do you expect to get if you do something like:
emoji = " "
print(emoji[:3])
Should this throw an error because there's only one displayed "character"? Should it return only a partial codepoint by returning only the byte data for the first 3 bytes?
Modern strings are complex objects that have evolved a bit past char[] or byte[].
> Strings are just an array of unicode codepoints rather than "characters", so all I'm doing is asking for the first three of those codepoints.
"Ice trays are just a pile of molecules rather than "cubes", so all I'm doing is separating those molecules", he states as he activates the igniter.
> Substring is a broken operation? What's the justification for that idea?
You take a thing and you mangle beyond recognition without regards for its purpose or meaning. That's like considering the jaws of life a normal part of opening a door to take a piss at work.
I think this is where the misunderstanding comes in. Python doesn't treat strings as char[] but as essentially unicode_codepoint[].
Whether this is a good idea on the whole is debatable, there's even a full PEP talking about the security concerns around doing it this way[1].
However, given this is how it works, the behaviour displayed makes complete sense to me and is the best of the bad choices presented by needing multi-byte strings.
> For a language where you're not supposed to need to know the byte size semantics, the correct length should be 5. What am I missing?
In the words of the article: “The choice of UTF-32 (or Python 3-style code point sequences) arises from wanting the wrong thing.”
“Not needing to know the byte size semantics” seems reasonable, but it simply isn’t a useful goal. The things it makes easier or faster (knowing how many code points there are, and O(1) indexing by code point) are things you shouldn’t be doing—and when you have to interact with the rest of the world, you now have a more expensive encoding step that is always needed, rather than just sometimes if you’d chosen UTF-8 or even UTF-16.
... but maybe it simplifies and speed-up the internal processing? I haven't looked at Python 3 C implementation of strings, but that is a guess. Also, IIRC, Python 3 has the ability to keep different internal representations of strings and uses the most compact one. If all character are 7-bit ASCII, it uses bytes representation. That's what I remember from Python dev discussions long ago.
But the overall tone of the article is still bashful. Caring about the internal representation of strings and bashing UTF-32 feels lame and angry. (Especially if I'm right about the multi-rep nature of Python 3: their choice is good for most text, and they could add a UTF-8 internal rep in the future, although that would probably break enough code that expect the UTF-32 value for len() that it is not worth it.)
No. Python internally would be made much faster by working on pure UTF-8. Absolutely nothing internal to the language uses the operations that code point semantics speeds up.
Since you mention the varying internal representation of strings: that’s PEP 393 <https://peps.python.org/pep-0393/>, which landed in CPython 3.3, and it generally made things slower by introducing a lot of branching and reallocating and such, though it does speed up some cases due to having to touch less memory, and some methods due to being able to quickly rule out possibilities (e.g. str.isascii can immediately return False for a canonical UCS-2 or UCS-4 string, since if they were ASCII they’d have been of the Latin-1 kind).
PEP 393 was done because people were complaining about how much memory their UCS-4 encoding had been using.
Note also how PEP 393 retains code point semantics: Latin-1 (Unicode values 0–255), UCS-2 or UCS-4; all fixed-width encodings of code point sequences. PEP 393 does also allow a string to cache UTF-8 representation (see PyCompactUnicodeObject.{utf8, utf8_length}), choosing “UTF-8 as the recommended way of exposing strings to C code”, but I gather this isn’t used very much.
> I'm basing this all on the idea that it's almost always a mistake to confuse how a program manages some data, vs how a drawing lib might. Your language shouldn't concern it self with how many glyphs it needs to draw... until you actually try to draw them.
Well, why not? There are a lot of things that people would want to call string.length for — drawing little equals signs under text in a terminal, for a frivolous example — where that’s the whole reason they’re making the call. Off the top of my head I’m not really sure how you solve that with variable-width characters if there’s no way to separate out or count them.
Who said there shouldn't be one? The point is there should be more than one, and that not all are a language/string library-level concern.
The context is "I guess I'm basing this all on the idea that it's almost always a mistake to confuse how a program manages some data, vs how a drawing lib might. Your language shouldn't concern it self with how many glyphs it needs to draw... until you actually try to draw them."
This means that shouldn't be some generic "length" method, but appropriate separate-concerns methods (plural), some of which (e.g. regarding character width in pixels when rendered) even belong to a drawing lib and not the language at all.
The parent's point is that length (bytes), characters (count), and glyphs (size, shape) are different concerns. The latter would concern a drawing lib or a renderer, but not be a core string method (which should concern itself with the abstract notion of characters and the concrete notion of bytes).
As far as I can tell, you're only missing two things:
1. It's five "Unicode scalars," that's the name for the top-level logical unit. The term "code points" technically refers to a lower-level concept, one that varies across encodings, just not as much as the number of bytes. I didn't know that, and it's the helpful thing I learned from this article. UPDATE: And it's also not true, sorry. "code units" are the lower-level concept from the article, "code points" are a more expansive category at the same level: https://www.unicode.org/versions/Unicode10.0.0/ch03.pdf#G740...
2. The author takes it as an unstated assumption that top-level logical structure is useless because any specific usage either ignores all structure or has a point at which low-level structure comes into play. (That assumption is false: Top-level structure is useful for keeping track of what you are doing and as a sort of "common currency" for translating between different low level representations. For example, see the very first table in the article.)
> The close second being 17, because length in bytes. Is another fine way to represent this data, e.g. what a successful write of some sort would look like. Network or file.
Almost. 17 is the number of bytes it occupies in memory. But you don't generally dump memory directly to disk or network. It happens to make sense (and it's convenient) for utf8 strings. But it's better to be explicit about that. Python is better. If you care about bytes, say you care about bytes:
I think he meant to bring out defensiveness with that quip. He never says that it's a big deal, just that it's the worst way to get the length of a string containing emoji, presumably of the mainstream languages.
> Note about Python 3 added on 2019-09-09: Originally this article claimed that Python 3 guaranteed UTF-32 validity. This was in error. Python 3 guarantees that the units of the string stay within the Unicode code point range but does not guarantee the absence of surrogates. It not only allows unpaired surrogates, which might be explained by wishing to be compatible with the value space of potentially-invalid UTF-16, but Python 3 allows materializing even surrogate pairs, which is a truly bizarre design. The previous conclusions stand with the added conclusion that Python 3 is even more messed up than I thought!
You can’t use that information to know how much actual space it takes (in storage) as nobody sane stores UTF-32, you can’t use it to know much much logical space it takes (aka the user’s interpretation), you can’t use it to know how much visual space it takes (not that you can ever get that), and you can’t use it to segment or process the text.
A length in codepoints gives you nothing that’s really actionable, at least not that you’d need outside of a context where you could easily obtain it otherwise.
> It is useful: When iterating over a string in Python (which I hope you agree _is_ useful?), you get that many parts.
That’s… not useful?
I can’t say I remember ever caring knowing how many items I would be getting during an iteration[0]. If I want to set an iteration limit I can just… do that, using `islice` or some such.
[0] in python anyway, in lower level language there can be a utility in order to pre-allocate an output collection
What is useful? What should it be? Don't say bytes because there is already an idiomatic way to get bytes: `len(bytes(s, enc))` which is both more correct and explicit.
Maybe it should just return None because the only useful thing is probably how much "space" it occupies on screen in a fixed-width font, but that's too difficult to know.
That's not true though. It just counts the number of code units, that's not version dependent. It's certainly no worse than counting the number of UTF-16 points (I'd argue it's better since it's less arbitrary - whether something is a unicode scalar is a design decision, whether something is in the BMP or not is mostly an accident of implementation).
I'm not a fan of "everything you know about X is wrong" articles. Very often they try to present some little tidbit of knowledge as a revelation and mislead the reader in the process.
In this case, the tidbit is: "grapheme clusters exist and they are useful".
The misleading part is that the article draws a false equivalence between what the author calls "UTF-32 code units" and UTF-16 code units.
UTF-32 code units are Unicode code points. This is a general Unicode concept that exists in all Unicode encodings. UTF-16 code units, on the other hand, are an implementation detail of UTF-16. It is wrong to present them as equally arbitrary concepts.
> UTF-16 code units, on the other hand, are an implementation detail of UTF-16.
Would that it were only so. Instead, UTF-16 ruined Unicode for everyone with the abomination that is surrogates, and almost nothing that deals with UTF-16 actually asserts well-formedness, and ill-formed UTF-16 cannot be represented in UTF-8 or UTF-32.
UTF-32 and UTF-8 code units are truly implementation details of their encodings, as other encodings don’t need to know about them in any way. UTF-32’s code units are a trivial mapping between scalar values and 32-bit values (not four-byte values, given the big- and little-endian variants), but that still causes UTF-32 code units to be semantically distinct from Unicode scalar values. U+12345 is a Unicode scalar value and doesn’t have any “size”: it’s an abstract value. 0x00012345 is a UTF-32 code unit, a 32-bit value.
If you’re talking about encoding of Unicode scalar values, you talk about code units. Even when talking about UTF-32, the code unit/scalar value semantic distinction is worth maintaining.
> Instead, UTF-16 ruined Unicode for everyone with the abomination that is surrogates,
UTF-16 is a hack. Unicode originally thought 65,535 values should be enough to represent all human languages and so 16-bit fixed size characters would work. However, that proved incorrect. UTF-16 was a hack to try retrofit this onto systems that had already adopted this 16 bit character (Java, Windows NT, etc).
I don't see what's inherently wrong with UTF-16 surrogates. If I am not wrong, a given UTF-16 codeunit is unambigously either a complete code point, a first surrogate, or a second surrogate.
Why should we expect invalid utf-16 strings to be representable in utf-8 or 32? I don't see anyone trying to represent invalid utf-8 in utf-16 or 32.
> Why should we expect invalid utf-16 strings to be representable in utf-8 or 32?
We shouldn't care. UTF-16 should just be an encoding and its internal details shouldn't leak into Unicode code points. There's just no good reason to exclude code points U+D800–U+DFFF merely because 0xD800–0xDFFF happen to be used specially in UTF-16 encoding, just like U+0080–U+00FF aren't excluded merely because (most of) 0x80–0xFF are used in UTF-8 encoding.
Is having a hole from U+D800 to U+DFFF such a big deal? The parent comment was specifically talking about surrogate pairs. That to me looks more like buggy implementation issue rather than standards issue.
As a hole, it would only be annoying and a performance penalty for validation. But by its very design, it will leak, and it does in such ways that it became the worst thing to ever happen to Unicode. I don’t know of a single language or library that uses UTF-16 for strings that validates strings: every last one actually uses sequences of UTF-16 code units, potentially ill-formed, and has APIs that guarantee this will leak to other systems. This has caused a lot of trouble for environments that then try to work with the vastly more sensible UTF-8 (the only credible alternative for interchange). Servo, for example, wanted to work in UTF-8, for massive memory savings and performance improvements, but the web has built on and depends on UTF-16 code unit semantics so much that they had to invent WTF-8, which is basically “UTF-8 but with that hole filled in” (well, actually it’s more complicated: half filled in, permitting only unpaired surrogates, so that you still have only one representation).
So: the problem is that the Unicode standard was compromised for the sake of a buggy encoding (they should instead have written UCS-2 off as a failed experiment), and every implementation that uses that buggy encoding is itself buggy, and that bugginess has made it into many other standards (e.g. ECMAScript).
That’s one of the two situations I speak of: when it happens in practice.
The other is… well, much the same really, but when it makes it into specs that others have to care about. The web platform demonstrates this clearly: just about everything is defined with strings being sequences of UTF-16 code units (though increasingly new stuff uses UTF-8), so then other things wanting to integrate have to decide how to handle that, if their view of strings is different: whether to be lossy (decode/encode using REPLACEMENT CHARACTER substitution on error), or inconvenient (use a different, non-native string type). Rust has certainly been afflicted by this in a number of cases and ways, generally favouring correctness.
The main issue is that it adds validation code (if one is sticking to the standard) for things that don't care about UTF-16 at all.
It does occupy 1/32 of the BMP, displaying a couple thousand potential actual characters (making them take an extra byte in UTF-8, and an extra two in UTF-16).
The UTF-32 equivalent is just the original UCS-4 — simply not enforcing any restrictions on the 32-bit value. Probably most code using UTF-32 does this, at least internally. (I can understand using high bits for metadata or non-Unicode points and have done so, but I don't see any reason for testing for surrogates outside of encoding/decoding UTF-16; they are indeed an abomination.)
>They’re not. UTF-32 code units have a 1:1 mapping to USVs, surrogates are not valid.
This is true, although very pedantic and irrelevant to the point of my comment. The distinction only matters when you're dealing with ill-formed strings.
BTW, Python strings can store surrogates.
>Is it? It’s not like they’re any more useful. Arguably less so, UTF-16 is at least a somewhat common storage medium.
If you aren't directly dealing with UTF-16, UTF-16 code units aren't useful at all.
Code points/USVs, OTOH, are the building blocks of Unicode strings and various Unicode algorithms operate on them. They're low-level, but not useless.
> I'm not a fan of "everything you know about X is wrong" articles.
But it’s not. That style is about tone and the article doesn’t exude that kind of tone.
Do you see the author scolding programmers for being ignorant Americans, for having unknown unknowns, or for not being “professionals”? Well, me neither.
In Julia, iterating over a string by default behaves like `each_char`.
`codeunits(str)` lets you access the underlying code units, which is bytes for the default UTF-8 encoding. (External packages implement UTF-16 and others, and there `codeunits` could return non-bytes, for eg. 16-bit values for UTF-16.)
The Unicode stdlib provides `graphemes(str)`, the equivalent of `each_grapheme_cluster`.
That means 7 is also a measure of bytes, just slightly more awkward. So it's roughly on par with 17.
For 5, the idea is that while you might want to iterate code points, the total number of code points is less useful than either grapheme count or byte count. I think that argument makes sense.
> That means 7 is also a measure of bytes, just slightly more awkward.
It's not a real measure of bytes though. It's the count of bytes in an encoding scheme that is (probably) neither what you use to communicate with the outside world nor what your language runtime uses. (And certainly it's no better than 5, since that's also a measure of bytes in a particular encoding).
Lots of systems use UTF-16 internally and externally. Counting bytes in UTF-16 is, on average, almost as useful as counting bytes in UTF-8.
I don't think just about anything communicates in UTF-32. 5 is basically just a codepoint count, and as such I don't think its usefulness rating should be between the byte counts.
> Lots of systems use UTF-16 internally and externally. Counting bytes in UTF-16 is, on average, almost as useful as counting bytes in UTF-8.
Not my experience at all. The article points out that even languages that are committed to an UTF-16 interface prefer to use other internal storage representations, and I can't remember the last time I saw it used in a transfer format.
I hate UTF-16 and the systems that use it with a passion, but...
Windows and Java (and Javascript) adopted unicode at a time when it was thought that 64k code points would be enough for everyone. Then they prioritized backwards compatibility over anything else. Most of us have benefited from their insistence on backwards compatibility in some form or the other, so I'm really not in a position to complain about it :-/
That said, IMHO any "length" property (as opposed to `codepoints` or `bytes`) on a UTF-16 string should definitely be deprecated.
Windows, Java, C#, javascript, a surprising number of XML documents (though less so as time marches on thankfully), ICU I think uses UTF-16 internally (for the same historical reasons as the other 4), JOLIET file names are UCS2, some phones interpret “16-bit” SMS as UTF-16 (the spec says UCS2).
> and BOTH of those are insane for sticking to it
They don’t really have much of a choice because they exposed those semantics as part of the string interface (or for Windows the interaction is slow low level it can’t be hidden), they have performance guarantees and behaviours which matches that.
It’s also why Python uses UTF-32, and went through the entire PEP-393 / FS complication to try and stop blowing up memory left and right: the core team considered that switching strings to UTF8 was a bridge too far.
There are approximate solutions, but they come with their own costs and complications (e.g. pypy uses UTF8 strings with lazily constructed indices to emulate UTF-32 strings).
I'm not a Windows based programmer, but couldn't they leave the old API's in place, but make UTF-8 safe versions available for everyone and switch to that... E.G. with Win 11?
You can set the system codepage to CP_UTF8 since Win 10, I guess, although IIRC it still doesn't work for input. But a) there is a lot of programs using A() functions that don't expect that and break in subtle ways, e.g. DBCS-encoding-aware programs suddenly break because they don't expect a codepoint to span for more than 2 bytes; b) most of the sanely written programs either use UTF-16 explicitly, or use UTF-8 internally and convert between UTF-8 and UTF-16 before/after calling W() functions.
The JavaScript language forces utf16 (whether or not v8 uses that representation under the hood). For instance if you want to substring the indexes you pass are for utf16 codepoints
I think that argument makes as much sense as saying that an engine is less useful than a car. And pretending that engine.weight should return the weight of the car.
It makes just as much sense as 17 (for utf8) in a JavaScript context, where charCodeAt(i) returns a utf-16 code point, and strings at least behave as though the implementation uses an array of uint16_t for the storage. Utf 16 is definitely not my favorite representation, but given that context (which the language imposes) 7 is an important number to be able to know.
Java loaded full unicode code point semantics into its standard `java.lang.String` class. These _are not guaranteed_ to have `O(1)` performance characteristics, because the underlying storage format is dynamically either a UTF-16-esque variant (with surrogate pairs for characters that don't fit in 16 bit), or a single-byte-per-char format if the string does not contain any non-ASCII. This has the advantage of being very very slightly more obvious, given that both methods exist and are documented:
void main() {
String x = "(that emoji here)";
System.out.println("Chars: " + x.length());
System.out.println("Codepoints: " + x.codePointCount(0, x.length()));
System.out.println("As stream of chars (= UTF16-esque with surrogate pairs):");
x.chars().forEach(System.out::println);
System.out.println("As a stream of codepoints:");
x.codePoints().forEach(System.out::println);
}
This ends up printing:
Chars: 7
Codepoints: 5
As stream of chars (= UTF16-esque with surrogate pairs):
55358
56614
55356
57340
8205
9794
65039
As a stream of codepoints:
129318
127996
8205
9794
65039
NB: Apparently many hackernews readers know java but don't use it all that often day-to-day. The provided java snippet is vanilla valid and can be executed with `java ThatFile.java` (no need to compile it first), though it does use preview features.
The fact that the codepoint counter is a very awkward `codePointCount` call has the dubious benefit of highlighting this method loops through and therefore would be quite slow on very large strings.
Did not you still need the `java --source 11 ${filename_without_java_extension_because_JEP_330}` to use it? And you still need a wrapper class with a static method main in it.
I was a little puzzled by this compared to what I was used to with Java in the past. It looks like the grandparent's code relies on JEP 445 ( https://openjdk.org/jeps/445 ) which is a preview feature as was mentioned but it also apparently requires the very latest Java 21 which hasn't even been officially released yet.
> And you still need a wrapper class with a static method main in it
One of the preview features he's using is JEP 445[1] that allows you to omit the wrapper class, as well as the arguments to main and the public and static modifiers.
I encountered some real world unicode/emoji breakdown recently. I set my surname in a webapp to an emoji country flag because I needed a way to communicate where I was. Elsewhere in the app, it showed surnames as just their initial, e.g. "John S". There, mine showed as a featureless black flag rather than the flag I set. Presumably because that is the first codepoint of several that make up the flag.
> There, mine showed as a featureless black flag rather than the flag I set. Presumably because that is the first codepoint of several that make up the flag.
The country flags are each made of two Unicode code points, which Unicode calls Regional Indicator Symbols. There are twenty six, one for each of the Latin capital letters A through Z. These are used to encode a flag by writing the ISO two letter country code from ISO-3166-1 e.g. F + R is France, you get a French flag.
Given your black flag experience, and the fact this is an English language forum, I'd guess maybe you wanted a flag for some entity that isn't a UN member state or some sort of recognised similar entity (e.g. the European flag EU symbolising the continent of Europe) and thus doesn't have an ISO two letter code, such as California or Wales. Those are built from a waving black flag plus their long ISO-3166-2 region code
Python 3’s approach snatched defeat from the jaws of victory.
They aimed to work with a nice, clean, abstract concept, untrammelled by encoding squabbles. They failed badly by choosing code units rather than scalar values (Unicode strings are sequences of scalar values, not code points—'\udead' is a valid Python string, but you can’t encode it into any UTF-* format since [U+DEAD] is not a valid Unicode string).
Then they also neglected to observe that they were optimising for something that you should practically never be doing, so that now everyone has to pay the costs. As the article summarises it part-way through: “The choice of UTF-32 (or Python 3-style code point sequences) arises from wanting the wrong thing.”
Seriously, Python 3’s approach is almost the worst of all available worlds. I loathe UTF-16 with such fiery passion that I can’t quite bring myself to say Python 3’s approach is worse than weak UTF-16, but it’s of similar badness in practical terms. The decisions were very clearly made by people that were not expert in the domain and who were caught up in a Concept of Mathematical Purity. They’ve since walked some of it back as far as they could, and I think did recognise it all as a mistake (no citation, just a vague memory of seeing such an admission), but they can’t fix it all properly without a breaking change.
> Unicode defines text as a sequence of code points.
Does it? Do you have a link?
[edit] I looked up the spec and here is what it says.
> The Unicode Standard does not define what is and is not a text element in different processes; instead, it defines elements called encoded characters. An encoded character is represented by a number from 0 to 10FFFF_16, called a code point. A text element, in turn, is represented by a sequence of one or more encoded characters. [1]
The definition of 'text' in the context of Unicode seems to explicitly not be defined as a sequence of code points, but rather a more nebulous sequence of aggregations of code points. It's probably closest to a grapheme cluster but they seem to want to avoid pinning it down.
Review chapter 2.2 Unicode Design Principles in the Unicode Standard: "Plain text is a pure sequence of character codes; plain Unicode-encoded text is therefore a sequence of Unicode character codes."
Text elements are an abstract concept whose definition depends upon what is being processed. It might be a grapheme, it might be word, etc...
There might be something a little imprecise here: code points vs code units vs character codes.
I'm open to being wrong but I would be very surprised if they defined text as a "series of code units" the count of which can vary by encoding even for the same character. IMO in this context 'character codes' would likely be far more consistent with 'code points' and they're just trying to differentiate between styled and un-styled text. Whereas the 1.3 definition appears to be trying to make an authoritative definition of 'text.'
If we read 2.2's "character codes" as code points, then that can be multiple code points as referenced in 1.3
[edit] I originally flipped 'units' and 'codes' - cleaned it up.
"Character code" is short for "character code point" or just code point. All Unicode algorithms and properties are defined in terms of the code point. UTF encodings are just a way of encoding a code point. From Unicode's perspective, you care about what is encoded (i.e. the code point) and not how it is encoded (i.e. UTF-8).
Unicode is one of the most poorly understood topics. I think the confusion stems from 1. most programming languages getting the abstraction wrong, and 2. programmers trying to reconcile their non-technical interpretation of what "character" means.
I agree with everything you said, I think I'm just trying to reconcile that with the top of thread saying python was the most correct because it was returning '7 code points' and that 'UTF-whatever is an implementation detail'
But 7 is not the number of code points/USVs - that's the number of UTF-16 code units. The string is 5 USVs. If UTF-whatever is an implementation detail, wouldn't the correct answer to length be 5?
Treating Unicode strings as a sequence of code points is a completely valid thing to do, but is usually not what you actually care about when dealing with text. Really, are code points any less of an implementation detail?
Code points are what you care about when you do any kind of text-based format encoding or decoding. Any of JSON, XML, HTML, YAML or whatever is defined by sequence of code points. There is no reason to complicate these with visual representation-specific concepts.
If you have to care about the visual representation of text then you probably need to be familiar with other concepts as well.
But, given the root ancestor of this comment, it’s worth clarifying that Python’s approach to strings doesn’t help at all with things like decoding JSON/XML/HTML/YAML; what Python gives you is random access by code point index, which you won’t ever need to use in such tasks.
Unicode defines text as a number of different types of things. They are sequences of codepoints, sequences of graphemes, sequences of graphime clusters. Furthermore, codepoints are different depending on how you normalize them. Accented characters can be written two different ways and have a different number of codepoints depending on how you write them (and if normalization is used)
Grapheme are a made up human thing that, while useful, is locale dependent. Most people when they talk about grapheme clusters mean the default "locale-independent" graphemes but it's not the only one (in Hungarian for example 'ly' is a single letter). Having the same string be two different lengths in two countries is… let's go with surprising. The common denominator where everyone computes the same
number is code points.
There is no "most" correct, since the "length" of UTF encoded text is ambiguous. The point of the post is to highlight which semantics are the most useful and the tradeoffs.
Really the correct way to design string APIs would be to not have an ambiguous "length" at all, but to always require specifying whether you want UTF8-bytes, memory bytes, code points, graphemes, whatever.
However such an API would be pretty cumbersome because for all non-edge cases (read: a western language and a reasonable encoding that language - which when looking at world demographics is a very narrow way of saying non-edge case) we just want to ignore all that fancy stuff and assume it's latin-1/ascii and use "Length" and get on with it, usually accepting that it doesn't work for many scripts or emoji.
So almost every api I have encountered has both the dangerous or ambiguous "length" and any number of the more specific counts. Good? No. But good enough, I guess.
A much worse related API that exists every where is that for parsing and formatting numbers to and from text. How that's done "depends" but most languages I have seen - unfortunately - offers a "default way". In the worst examples - looking at you .NET - this default uses the system env and assumes formatting and parsing numbers should us the OS locale. Horrible horrible idea when used in conjunction with automatic type conversions. WriteLine($"The size is {3.5}"); shouldn't print "3.5" in the US and "3,5" somewhere else.
>Horrible horrible idea when used in conjunction with automatic type conversions. WriteLine($"The size is {3.5}"); shouldn't print "3.5" in the US and "3,5" somewhere else.
Because it’s only (maybe) a good design if it’s to be read by a human but that’s not a very general case. Instead people unknowingly make for example some exporter for a text format and write code that writes "X={x_coord}" and it passes all the unit tests and all the acceptance tests and then it breaks once it hits a French or scandinavian machine.
A great example how bad it is would be that the C# compiler repo for a very long time had tests that failed for everyone with non-US formatting.
Unsurprising that (at least some implementation of) Swift does the least wrong thing in returning 1. I think it's also one of the few languages that will return a count of 1 for the madness that is country flag emojis https://docs.swift.org/swift-book/documentation/the-swift-pr...
“Least wrong” sounds very silly. Its like programmers are discovering theres a difference between bytes, unicode code points and grapheme clusters and are unsure about how their favorite programming language represents strings, and then decide there should be some behavior that doesnt follow from the documentation.
The “length of an emoji” depends on the data type used to represent it. Its that simple and that correct.
I have read somewhere that you should learn 2 or 3 programming languages from the get go. If you learn one, you run the risk of letting it's shape dictate how you mentally model computation. At some point someone who learned a dynamically typed programming language first is bound to find out why data types matter.
I had a "programming languages" class that did that, where we did assignments in Python (scripting), OCaml (functional), and Prolog (logic). This is because most other classes used compiled imperative languages such as C++ and Java.
I definitely don't have talent for logic and quantitative thinking. It takes a long time and many iterations for even simple concepts in mathematics to sink in for me. I benefited greatly from learning first Scheme and also making sense of C and OS internals before trying to grok interpreted languages. I'm currently trying to get some proficiency in Go and it's been great fun!
I think this is a really a naming convention issue. Len() is ambiguous, you really want either num_chars() or utfxx_len(). Of course, the issue of what counts as a character is confusing in its own right...
In Python len() on a bytes type gives you the number of bytes, and len() on a str type gives you the number of codepoints. I think that makes sense, as strings are only intended to deal with text, and you should never have to worry about byte indexing at all.
As someone who has done both, I'd say that argument is wrong. It is much more convenient to index by code point. Indexing by bytes is almost always what you don't want to do, and leads to a lot of errors.
In many cases it's not very useful, but there are clearly cases where it is, e.g. if you want to normalize text, compose/change emojis, stuff like that.
A codepoint is the "smallest useful addressable unit" when dealing with Unicode text, so it makes sense that's the default.
It's also comparatively expensive to address grapheme clusters.
> In many cases it's not very useful, but there are clearly cases where it is, e.g. if you want to normalize text, compose/change emojis, stuff like that.
I can see that iterating through by codepoint could be useful for some of those cases, but I still can't see why you'd ever want to index by codepoint?
For the same reason you want to index anything: to slice, remove, etc. stuff. e.g. to replace a skin tone in an emoji: "str[i] = 0x1f3ff", or to insert one: "str = str[:i] + 0x1f3ff + str[i:]".
But that's a pointlessly inefficient way to do it - surely what you want there is to iterate and transform rather than scan through and then slice? (And don't you need to group by extended grapheme cluster rather than codepoint anyway for that to make sense?)
The wife and I have a Google sheet that we use for our shared calendar - and we put an emoji before each "event" and in top row of each day I show the Emoji for that days entries. But I need to do:
Mind you, this is inefficient due to unnecessarily constructing an array. Here’s a more efficient version, though the difference will normally be fairly slight:
function codePointLength(str) {
let len = 0;
for (const c of str) {
len++;
}
return len;
}
Kinda sad there are no equivalents to the Array methods that work on iterators. Array.prototype.reduce.call(str[Symbol.iterator](), (a, _) => a + 1, 0) doesn’t work since those methods only work on array-like types (meaning those with a length property and indexed by number—and yes, all these Array methods are explicitly defined that way deliberately so you can use them on other array-like types), not iterators.
Caution: Intl.Segmenter may not be available, so be sure to have a fallback if you want to use it. Chromium shipped it 2½ years ago, Safari 2 years ago, and Firefox hasn’t shipped it yet. (No idea why and I haven’t looked. It’s not always the case: I know of other Intl things that Firefox has shipped first.)
.each_codepoint.size is more efficient than .codepoints.size, as it creates a sized Enumerator that avoids needing to build an intermediate Array. For strings with only single-byte characters it reduces to returning the already-stored stored byte length.
Same goes for .each_byte.size, but for that you have the faster .bytesize method that avoids the intermediate Enumerator.
Very good and informative article, though still not convincing that the nudge to make the shortest "len" command use the human readable size of grapheme clusters like in Swift isn't the best design approach, all the non-intuitive sizes should be special
The article shows that the Swift approach produces different values for length depending on operating system and text library versions. Is that really intuitive?
The Swift approach can't reach perfection in isolation because data from the future can always break it.
That's why in the article you see Swift running on Ubuntu 14.04 returning len==2 while the same code on Ubuntu 18.04 returns len==1 for the same emoji string.
IMO that's a big philosophical question here: do we accept that "string length" means something you can't compute for arbitrary strings unless your code is receiving annual updates containing the latest Unicode interpretation instructions?
Swift includes its own Unicode data tables with the standard library since last year, so it’s now tied to the stdlib version rather than some other library that may or may not be updated on the system.
Your example shows an improvement, which proves my point (also don't drop the word asymptotically, nothing can ever be perfect, that's not the issue, being closer to perfect is a positive)
And you can compute it, you can pin a Unicode version and ship it in the language if those platform differences are unbearable (so, you can actually isolate it and simply ignore the future :))
The bigger philosophical question: how much longer do we accept that "string length" does not measure the most intuitive measure of string length and call a byte a char?
I cannot think of a single common case where grapheme cluster count is important. If you want to print them aligned to a terminal - guess what, double width characters exist, so the only reliable way is to print them first, measure the cursor movement using escape sequences, calculate length and erase the originally printed data.
Even for limiting input field sizes byte count is much better, as otherwise you are opening up yourself for unicode denial of service. I think the game Minecraft has such an exploit where you can fit in absurd amounts of utf-8 data (to the point of data corruption in multiplayer games) since it's limited by visual length.
My personal favorite dealing with UTF-8: pretend it's ascii and assume everything above 128 is an alphabetic character. It just works. For 99% of use cases it doesn't matter if the content is emojis, families of emojis, or ancient sumerian scripts. You can parse JSON and most other formats this way without caring about code points at all. The trend of unicodizing everything was a mistake, just treat strings as bytes and parse them as utf-8 only when you really need it (like when building a text editor or a browser engine from scratch).
But it does - the genius of utf-8 is that it was deliberately designed to be backwards compatible (it even preserves the ascii sorting order). You can run C programs written before utf-8 was invented with utf-8 inputs (unlike with the abomination that is utf-16).
If a code point is outside the ascii range (0-127 inclusive), then it's utf-8 encoding is also guaranteed to not contain any ascii bytes. So as long as you treat anything inside 128-255 as "some unknown character", the utf-8 code points will be preserved and eventually displayed when the byte sequence is parsed as utf-8 by your terminal/browser/whatever.
mysql> WITH chars AS (SELECT ' ' c)
-> SELECT LENGTH(c), CHAR_LENGTH(c) FROM chars;
+-----------+----------------+
| LENGTH(c) | CHAR_LENGTH(c) |
+-----------+----------------+
| 17 | 5 |
+-----------+----------------+
1 row in set (0.01 sec)
Note that the doesn't seem to render in preformatted text on HN.
This should be easier to reproduce:
mysql> WITH chars AS (SELECT 0xF09FA4A6F09F8FBCE2808DE29982EFB88F c)
-> SELECT CONVERT(c USING utf8mb4), LENGTH(c), CHAR_LENGTH(c) FROM chars;
+--------------------------+-----------+----------------+
| CONVERT(c USING utf8mb4) | LENGTH(c) | CHAR_LENGTH(c) |
+--------------------------+-----------+----------------+
| | 17 | 17 |
+--------------------------+-----------+----------------+
1 row in set (0.00 sec)
For all intents and purposes, a user will count it as one character. Truncating the string without including the whole cluster would change the meaning of it, and is not an operation anyone would do as a general purpose thing any more than someone would want to randomly replace the last character with random letters.
It looks like one character. I'd rather APIs let us continue pretending it is one character.
It's not weird because it's not a 160 limit in the sense of characters. Non-English language people are I guess more aware of this -- using a character not in the standard English a-z often counts as two characters.
Can anyone comment as to whether there are any problems associated with using emojiis to enhance the entropy of passwords? For passwords you only need to autofill but never actually type, I feel like it would be an easy way to augment passwords but I don't know whether it would directly translate in every situation.
All these abominations are because of non strict typing
String = List ( Char )
Chars don’t have a length, like a number doesn’t have a length - unless you talk about number of bits. If you are working with strings stick with strings. The string of a single character should be “1”. Just enforce proper typing. Anything else is not consistent.
No, it's caused by cost. For example Java has a char type. It's a 16 bit numeric value because Java uses UTF-16 internally for encoding strings. Java Strings are basically immutable char arrays with some fluff around them. If you ask for the String length, it returns the length of the underlying array. Nice and simple and unsurprising. And relatively cheap. Most more recent languages use 8 bit bytes and UTF-8 instead because that is emerged as the most common character encoding. But UTF-16 was a reasonable choice a quarter century ago and the practical difference doesn't matter that much and changing it would be disruptive.
If you put unicode characters consisting of multiple data points into a String, it necessarily increases the amount of chars. There's no way around that. Because there is no such thing as a UnicodeChar type in Java. You can't actually assign multi data point unicode characters to a char.
Essentially all the workarounds for a 'correct' unicode character count in a String would either end up using a different and probably way more expensive data structure (e.g. a list of a list of chars or bytes where each list is a unicode character) or implementing some expensive logic for counting characters that is O(n) instead O(1). Most languages ranging from extremely strictly typed to weakly typed don't do that for cost reasons. The tradeoff is simply not worth the price it takes.
Python has strong typing which seems to be what you mean here rather than strict typing.
A "character" is not a well defined term in Unicode, rather the "base" that does not vary across implementations is code points, which is what Python measures when you get the length of a string.
Am I wrong for assuming the .length should return a length in bytes? If you want to use 32bit units, then multiply your output by 4.
If you want to do Unicode string manipulation and length counting, then use specific functions for that - but the base internal .length function should just output bytes.
The most obvious use case for length is iterating over the string and indexing it. In JS (or Go, Rust, Python) indexing and iteration is not byte based. As has been said elsewhere, length depends on the context/way you use it.
In Rust you need to specify what it is you think you're going to "iterate over" in a string.
You can't just "iterate over a string" because that's not a thing. You can get an iterator over the bytes in the string, with "foo".bytes() or you can get an iterator over the Unicode scalar values in the string with "foo".chars(), or you can iterate over a UTF-16 encoding of the string with "foo".encode_utf16()
You can index into Rust's strings, but you need to specify slice indexes, you can't just treat this like it's a array because that's not what it is. If you wanted a slice of bytes you can have one cheaply, it's as_bytes() which is a [u8] and you can index directly into that slice as with any array of bytes, but you can't mutate that and those aren't characters, they're just bytes.
> Do you think the length of an `int64_t[3]` array should be 3 or 24?
There should be functions to do both: sizeof(int64_t[3]) * sizeof(int64_t) for example to get bytes.
In this example, the base function should do bytes, and there should be a unicode function to count it in other ways.
I could be sizing to fit in a database, or send over the wire, or I might want visible space on the screen, or I might want to know how to move the cursor.
Each of those types of length should be supported.
It's useful if you want array-like semantics (e.g. O(1) lookup) on Unicode text strings, because you have a fixed size for every codepoint, unliked UTF-8. Python for example uses it internally.
I think that's the issue here. People disagree on how useful or not useful it is. It's maybe not ideal, but I don't think it's anywhere near so bad as to be entirely not useful. Strings-are-sequences-of-bytes is worse in my opinion. Python literally used to have that. It was worse.
The problem with what Python used to have is that the encoding wasn’t fixed.
I’ll agree with you that strings-are-sequences-of-bytes is bad. That’s painful compiler-flag, codepage, &c. territory.
But what’s not bad is strings-are-sequences-of-code-units. That’s what Rust has, for example. Rust strings aren’t sequences of bytes, but of UTF-8 code units, and the two are semantically very different.
Once again, strings are not simple sequences of characters. It's also useless to "index" into a string without specifying what you're indexing for the same reason.
this one one of those things that people point to when comparing languages, but in reality rarely matters. with Go, you just get the number of bytes, which the the correct default thing to do:
if the language default was anything other than this, THAT WOULD BE WRONG and unexpected. I would prefer the default to be the dumb, fast thing. then if I want the slow, fancy thing, I can import some first or third party package.
I think to some extent it depends on the language. In the article they talk about Swift's implementation, which by default does the slow, fancy thing (but makes it easy to do the dumb, fast thing). String manipulation in Swift is almost certainly going to be used for a GUI for end users of many possible languages / locales, so it makes sense to spend the extra cycles to get the fancy version by default. If it isn't the default then you'll end up with half the apps on the App Store displaying broken text on line breaks, ellipses, wrapping, etc. on their hand-rolled UI stack.
For anyone wondering what Go does, it looks like Python2's way[1]; strings are byte sequences with no guarantees of UTF{anything} correctness. Go's source code is specified to be UTF8 so string literals in source code will become valid UTF8 encoded strings, but any string from any library call or code you didn't write might contain invalid Unicode text, or mixed encodings, or anything.
That feels a bit "pit of despair" design[2], the default thing is unhelpful and doing more than that requires the programmer to climb up out of it.
The sad thing is returning unicode code points is probably not going properly do what you wanted to do either... sliding down the slippery slope, you'd end up needing a text layout renderer and a language model to do what you thought you wanted to do. (and then there'll be a thousand bugs and edge cases that your libraries didn't handle properly)
Sure, however that's actually decoding the string into Unicode scalar values, and then counting them whereas the length of the string is a direct property of the string reference (it's a fat pointer [address + length])
I don't remember, but I think the size hint is set on the Chars iterator, so it can see it has 17 bytes of data, it knows that can't encode more than 17 Unicode scalar values, nor can it encode fewer than five. But since we ask for an exact count that hint is unused, the actual decoding will take place.
Yes, your point? That is the same thing which happens in Swift if you request the length of a string and it gives you the number of glyphs (1, in this case).
Rust doesn't take sides here. It exposes all the different ways you might want to calculate the "length" of a string, and lets you pick which one you mean. The non-zero-cost choices involve a multi-step specification (like `.chars().count()`), which states explicitly the calculation involved.
Asking str.len() is a single very cheap operation, it's not only O(1) in the sense you'd learn in an algorithms course, it's really actually very cheap to do, it's fine if an algorithm relies heavily on str.len()
In contrast chars().count() creates an iterator and runs the iterator to completion counting steps, that's O(N) for a string of length N, and is in practice very expensive, you should definitely cache this value if you will need it repeatedly. It is possible the compiler can see what you're doing and cache it, but I am very far from certain so you should do so explicitly.
This is important in contrast to say, C, where strlen(str) is O(N) because it doesn't have fat pointers and so it has no idea how long the string is in any sense.
Yeah but unfortunately it provides `.len()` directly. It's documented to make clear that it's the bytecount and not the characters, and that humans usually work with characters, but given that this isn't even a trait implementation I think `.as_bytes().len()` or something would have been better.
This is only if you want strings to be sequences of bytes. If you want strings to be sequences of code points, it is more sensible to define string length as the length of the sequence. I prefer the latter (for coded text) because it is closer to the meaning of the string. Sequence of code points is always sequence of code points, but a sequence of bytes may not correctly encode a sequence of code points, and bytes in encoding are not in one-to-one correspondence with code points in string. So I see no reason to care about individual bytes per se in the string's code.
Because whenever you want to store or transmit a string only the byte count matters (the size of the string). All the fancy unicode stuff on top of bytes is for the display layers to handle. The default should be grounded to the reality of the programmer.
Storing and transmitting is always going to work with low-level storage units like bytes, so your string will need to be converted to that first. But string manipulation is extremely common in programming, and I would think graphemes are the most useful unit here - i.e. as a programmer my preference would be for swift's behaviour.
Human interaction is a more grounded reality for programmers vs. the dumb land of pure bytes, so even at that conceptual level the default should be smart
And bytes is the only thing that matter for a specific type of string, conveniently named, sequence of bytes
I've been quite happy that popular emojis were introduced in supplementary planes, because my language has quite a few common words (eg. 𨋢 [lift/escalator]) that ended up on plane 2.
Proper software support for those characters used to be terrible, but things got much better after emojis became popular. So, thanks and sorry everyone :)
Unification was reasonable at the time, given the goal to fit Unicode in 16 bits, and willingness to exclude obsolete characters. It's just that they followed official Japanese standards, and therefore unified too many from the point of view of other languages.
I think the first big mistake was using postfix/infix operators (combining characters, modifiers, variant selectors, joiners, etc.) rather than prefix, preferably in blocks by arity. That would have simplified processing (in particular a keyboard dead key could have been identical to a combining character) and made broken sequences detectable.
The latest big mistake, I think, was retroactively changing some non-emoji characters to have “emoji presentation”, which means that some text has to be edited to preserve its original appearance.
Another mistake IMHO was that they accepted too many "dictionary characters", i.e. the ones only seen once or twice in some obscure dictionary -- they often had explanations like "an obscure form of [common character]".
I agree that they introduced unnecessary complexity for text encoding, and for font-rendering (which are expected to support multi-coloured emoticons now). I once started writing on a text editor, and then fell deep into Unicode handling.
I have now spent more work on the Unicode parts than on anything else in the program.
I think that the industry could have instead adopted the old web-forum convention of colon-word-encoding, originating from ASCII art. Example: ":facepalm:".
When the sequence is not supported as an emoji, it degrades gracefully into text that can be understood by anyone reading it instead of into a sequence of empty squares or diamonds with question marks in them. Text also provides a more efficient input method than having to browse for an icon in a list.
There's most definitely a situation where you want the length of string that contains an emoji character and perhaps dumb things happen if you get that wrong.
I was never much impressed with that article (too much irrelevant story for an “absolute minimum”), and by now it’s very dated. A lot of what it’s talking about (most notably code pages) is now completely irrelevant to the vast majority of developers, who might never encounter or need to worry about them in their entire careers.
Interestingly, Firefox on Wayland renders the emoji correctly in the tab title, but the window title renders it as two rectangles and the male symbol. I assume this must be some difference between system fonts vs Firefox's fonts.
In the CIA since of "we need to handle these democratically elected leaders", perhaps.
But if I was selling you a drop-in comment widget and boasted "it handles all of Unicode", but really I was just running s/[^ -~]+/ /g, wouldn't you feel a bit let down?
Did I miss the part where he explains this take? It's made up of 5 valid unicode code units. For a language where you're not supposed to need to know the byte size semantics, the correct length should be 5. What am I missing?
The close second being 17, because length in bytes. Is another fine way to represent this data, e.g. what a successful write of some sort would look like. Network or file.
I guess I'm basing this all on the idea that it's almost always a mistake to confuse how a program manages some data, vs how a drawing lib might. Your language shouldn't concern it self with how many glyphs it needs to draw... until you actually try to draw them.