All these abominations are because of non strict typing String = List ( Char ) C...

jillesvangurp · on June 2, 2023

No, it's caused by cost. For example Java has a char type. It's a 16 bit numeric value because Java uses UTF-16 internally for encoding strings. Java Strings are basically immutable char arrays with some fluff around them. If you ask for the String length, it returns the length of the underlying array. Nice and simple and unsurprising. And relatively cheap. Most more recent languages use 8 bit bytes and UTF-8 instead because that is emerged as the most common character encoding. But UTF-16 was a reasonable choice a quarter century ago and the practical difference doesn't matter that much and changing it would be disruptive.

If you put unicode characters consisting of multiple data points into a String, it necessarily increases the amount of chars. There's no way around that. Because there is no such thing as a UnicodeChar type in Java. You can't actually assign multi data point unicode characters to a char.

Essentially all the workarounds for a 'correct' unicode character count in a String would either end up using a different and probably way more expensive data structure (e.g. a list of a list of chars or bytes where each list is a unicode character) or implementing some expensive logic for counting characters that is O(n) instead O(1). Most languages ranging from extremely strictly typed to weakly typed don't do that for cost reasons. The tradeoff is simply not worth the price it takes.

This stackoverflow post provides a few suggestions for how you could count 'correctly'. https://stackoverflow.com/questions/15947992/java-unicode-st... that illustrates the point nicely.

chasontherobot · on June 2, 2023

Python has strong typing which seems to be what you mean here rather than strict typing.

A "character" is not a well defined term in Unicode, rather the "base" that does not vary across implementations is code points, which is what Python measures when you get the length of a string.