Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Or take the XML route. Unicode offers several strategies to avoid continuously updating the Unicode table [1] which has been adapted by XML 1.1 and also later editions of XML 1.0. The actual syntax can be as simple as the following (based from XML, converted to ABNF as in RFC 8295):

    name = name-start-char *name-char
    name-start-char =
        %x41-5A / %x5F / %x61-7A / %xC0-D6 / %xD8-F6 / %xF8-2FF / %x370-37D /
        %x37F-1FFF / %x200C-200D / %x2070-218F / %x2C00-2FEF / %x3001-D7FF /
        %xF900-FDCF / %xFDF0-FFFD / %x10000-EFFFF
    name-char = name-start-char / %x30-39 / %xB7 / %x300-36F / %x203F-2040
In my opinion, this approach is so effective that I believe every programming language willing to support Unicode identifiers but nothing more complex (e.g. case folding or normalization or confusable detections) should use this XML-based syntax. You don't even need to narrow it down because Unicode explicitly avoided identifier characters outsides of those ranges due to the very existence of XML identifiers!

[1] https://unicode.org/reports/tr31/#Immutable_Identifier_Synta...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: