Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

On Windows, sure. On other platforms UTF-8 is generally preferable (in my opinion).


Annoyingly, there is no simple user accessible UTF-8 decoder in libc. The only standard way to use iswalpha is to convert to wchar_t first.

One hack is to assume that bytes of UTF-8 encoded strings above 127 are all letters. It mostly works :-)


> Annoyingly, there is no simple user accessible UTF-8 decoder in libc.

Am I misunderstanding you, because I've always thought that's what the mbtowc(3) family of functions was?


Well you are right, but these functions are not terribly fun to use. Consider a parsing function which extracts an identifier. For ASCII it's:

    if (isalpha(*s)) {
        *d++ = *s++;
        while (isalnum(*s))
          *d++ = *s++;
    }
To use UTF-8 / Unicode should require only small changes:

    if (iswalpha(decode(&s)) {
        encode(&d, advance(&s));
        while (iswalnum(decode(&s))
            encode(&d, advance(&s));
    }
For efficiency, don't decode twice- have the decoder return a pointer to the next sequence:

    if (iswalpha(c = utf8(&s, &n))) {
        encode(&d, c);
        s = n;
        while (iswalnum(c = utf8(&s, &n))) {
            encode(&d, c);
            s = n;
        }
    }
Also should be able to match a string in line:

   if ('A' == utf8(&s, &t) && 'B' == utf8(&t, &s) && 'C' == utf8(&s, &t)) // we have 'ABC'.


mbtowc isn't necessarily thread safe, it's better to recommend mbrtowc.


Just use setlocale(LC_ALL, "") in main, and use mbrtowc to translate from whatever the system encoding is into the wchar_t type. There's no need to bake assumptions about the system encoding into most programs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: