Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> What i was hoping for was some kind of term for one character or symbol and use that as a unit

There is one, kind-of: "grapheme cluster"[0]. This is the "unit" used by UAX29 to define text segmentation, and aliases to "user-perceived character"[1].

Most languages/API don't really consider them (although they crop up often in e.g. browser bug trackers), let alone provide first-class access to them. One of the very few APIs which actually acknowledges them is Cocoa's NSString — and Apple provides a document explaining grapheme clusters and how they relate to NNString[2] — which has very good unicode support (probably the best I know of, though Factor may have an even better one[3]), and it handles grapheme clusters through providing messages which work on codepoint ranges in an NSString, it doesn't treat clusters as first-class objects.

> i guess if you asked a Sanskrit speaker how long a word/sentence was, you'd get the answer..

Indeed.

[0] http://www.unicode.org/glossary/#grapheme_cluster

[1] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Bounda...

[2] https://developer.apple.com/library/mac/#documentation/Cocoa...

[3] the original implementor detailed his whole route through creating factor's unicode library, and I learned a lot from it: http://useless-factor.blogspot.be/search/label/unicode



Very interesting, going to read through that guys blog. Thanks for the links!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: