> grapheme clusters that are considered invalid by some peers
I’m not sure if there’s any spec that defines any sequence of Unicode scalar values as “invalid” (though there’s certainly a lot that’s obviously wrong, like some forms of script-mixing). Grapheme cluster segmentation doesn’t concern itself with meaningfulness, but just doing something with what it has; so if you inject something into the middle of what it decided was a cluster, it’ll just split it differently.
> treating the document as a sequence of arbitrary unicode codepoints
Though I hope you validate that they are Unicode scalar values and not code points. Surrogates are terribad.
> Though I hope you validate that they are Unicode scalar values and not code points. Surrogates are terribad.
Yes, my mistake. I do mean scalar values. I am constantly confused about the terminology for unicode. (Unicode code point? Scalar values? (Character?) Surrogate pair? Is there a term for half of a surrogate pair?)
I’m not sure if there’s any spec that defines any sequence of Unicode scalar values as “invalid” (though there’s certainly a lot that’s obviously wrong, like some forms of script-mixing). Grapheme cluster segmentation doesn’t concern itself with meaningfulness, but just doing something with what it has; so if you inject something into the middle of what it decided was a cluster, it’ll just split it differently.
> treating the document as a sequence of arbitrary unicode codepoints
Though I hope you validate that they are Unicode scalar values and not code points. Surrogates are terribad.