Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Let me ask you, with 10k commonly used characters doesn't that lead to shorter texts? Kind of like how higher base numbers can encode larger numbers with fewer digits, in that case the longer encoding of UTF-8 could be made up for by using fewer characters. Or am I wrong about this assumption?

As an example, suppose that there are one character that denotes the word 'house', if that single character is encoded using five bytes it takes the same amount of space as the english encoding.



That seems more than plausible to me. While the character 象 is two bytes longer than the character "f", it is five bytes shorter than "elephant".

IIRC the average word length in English is around 5 characters.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: