Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> grapheme clusters that are considered invalid by some peers

I’m not sure if there’s any spec that defines any sequence of Unicode scalar values as “invalid” (though there’s certainly a lot that’s obviously wrong, like some forms of script-mixing). Grapheme cluster segmentation doesn’t concern itself with meaningfulness, but just doing something with what it has; so if you inject something into the middle of what it decided was a cluster, it’ll just split it differently.

> treating the document as a sequence of arbitrary unicode codepoints

Though I hope you validate that they are Unicode scalar values and not code points. Surrogates are terribad.



> Though I hope you validate that they are Unicode scalar values and not code points. Surrogates are terribad.

Yes, my mistake. I do mean scalar values. I am constantly confused about the terminology for unicode. (Unicode code point? Scalar values? (Character?) Surrogate pair? Is there a term for half of a surrogate pair?)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: