C String Creation

jws · on Feb 11, 2016

Good news in the article, apparently asprintf is scheduled for the next POSIX standard. I wonder how one sees what is in the next POSIX standard, it would inform my choices of what I use. I'll use something that works on all my platforms if it is headed to a standard, but usually not if it may head to oblivion.

Also open_memstream is POSIX 2008, now if it would just get into OS X…

generic_user · on Feb 12, 2016

These functions are proposed for the next C standard. They are in part 2 of the bounds checking interface. The relevant proposal is n1337 'dynamic allocation functions'

  asprintf() 
  aswprintf()
  vaswprintf()
  fmemopen() 
  open_memstream() 
  open_wmemstream()

audidude · on Feb 12, 2016

If you want this today on all (C89) platforms, GLib ships an embedded copy of gnulib and runs virtually everywhere.

sortie · on Feb 12, 2016

gnulib is atrocious beyond belief though. I hear GLib tends to abort your process on OOM, though I haven't done my research on this library, so I would be careful using these to develop reliable software.

jhallenworld · on Feb 11, 2016

Should we not be using wchar_t strings in modern C?

    int main(int argc, char *argv[]) {
        wchar_t buf[100];
        wprintf(L"Hello, world!\ntype something>");
        if (fgetws(buf, 100, stdin))
            wprintf(L"You typed '%ls'\n", buf);
        if (argv[1]) {
            char *s = argv[1];
            /* Convert char string to wchar_t string */
            size_t len = mbsrtowcs(buf, &s, 99, NULL);
            if (len != (size_t)-1) {
                buf[len] = 0;
                wprintf(L"argv[1] is '%ls'\n", buf);
            }
        return 0;
    }

It's a pain, but the advantage is access to iswalpha() and friends.

apaprocki · on Feb 12, 2016

"Modern" C is char16_t and char32_t. The old wchar_t type has many issues. You can read more here: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1286.pdf

sortie · on Feb 12, 2016

char16_t and char32_t are useless. The C standard declares functions in <uchar.h> for converting them to and from char, but not wchar_t. The conversion to char may be lossy depending on the platform. No other interfaces uses those types. There's no portably lossless path converting them to and from wchar_t.

ue_ · on Feb 12, 2016

There are various "solutions" to this problem of holding one "character" per instance of a type. If for some reason you don't want to use char * (for example you want to find the length of a multi-byte-per-character length string), there's https://github.com/cls/libutf

apaprocki · on Feb 12, 2016

The point is that wchar_t should be removed entirely. Assume it doesn't exist anymore and use char16_t and char32_t everywhere.

sortie · on Feb 12, 2016

This is entirely undesirable. First of all, char16_t and char32_t are kinda useless as there's no standard interfaces using them, and there's no conversion functions to and from wchar_t.

Secondly, no, you're asking for a massive addition of 2 new versions for every interface that mentions wchar_t. That's a huge addition to standard libraries. That's error prone and bloats things up. Then additionally you're asking for a rewrite of all software using wchar_t. And only until everything is transitioned, which isn't going to happen, the standard libraries will be much larger.

The solution is rather to embrace wchar_t and fix it. All sensible and modern platforms, which is a premise of this article on modern POSIX functions, have a 32-bit wchar_t type. That's excellent. It's only Windows, which due to historical short-sightedness that have 16-bit wchar_t. But writing portable C for native Windows is a losing game, the winning move is not to play. (Do see midipix which is upcoming and will provide a new POSIX environment for Windows with musl and 32-bit wchar_t). In fact, 16-bit wchar_t violates the C standard. That moment you give up broken platforms with 16-bit wchar_t, wchar_t works as intended, and this is a non-problem. Embracing char16_t and char32_t is a worse problem and isn't solving anything.

hthh · on Feb 11, 2016

On Windows, sure. On other platforms UTF-8 is generally preferable (in my opinion).

jhallenworld · on Feb 11, 2016

Annoyingly, there is no simple user accessible UTF-8 decoder in libc. The only standard way to use iswalpha is to convert to wchar_t first.

One hack is to assume that bytes of UTF-8 encoded strings above 127 are all letters. It mostly works :-)

akira2501 · on Feb 12, 2016

> Annoyingly, there is no simple user accessible UTF-8 decoder in libc.

Am I misunderstanding you, because I've always thought that's what the mbtowc(3) family of functions was?

jhallenworld · on Feb 12, 2016

Well you are right, but these functions are not terribly fun to use. Consider a parsing function which extracts an identifier. For ASCII it's:

    if (isalpha(*s)) {
        *d++ = *s++;
        while (isalnum(*s))
          *d++ = *s++;
    }

To use UTF-8 / Unicode should require only small changes:

    if (iswalpha(decode(&s)) {
        encode(&d, advance(&s));
        while (iswalnum(decode(&s))
            encode(&d, advance(&s));
    }

For efficiency, don't decode twice- have the decoder return a pointer to the next sequence:

    if (iswalpha(c = utf8(&s, &n))) {
        encode(&d, c);
        s = n;
        while (iswalnum(c = utf8(&s, &n))) {
            encode(&d, c);
            s = n;
        }
    }

Also should be able to match a string in line:

   if ('A' == utf8(&s, &t) && 'B' == utf8(&t, &s) && 'C' == utf8(&s, &t)) // we have 'ABC'.

sortie · on Feb 12, 2016

mbtowc isn't necessarily thread safe, it's better to recommend mbrtowc.

sortie · on Feb 12, 2016

Just use setlocale(LC_ALL, "") in main, and use mbrtowc to translate from whatever the system encoding is into the wchar_t type. There's no need to bake assumptions about the system encoding into most programs.

sortie · on Feb 12, 2016

No, it's important to understand the distinction between char and wchar_t. Both are relevant, but in different contexts. char should be considered a byte type to pass around UTF-8 with. This is the appropriate level for the large majority of common string operations, such as concatenation, outputting strings directly, parsers that only handle ascii characters specially, and so on.

Those applications don't really care about the actual unicode codepoints besides ASCII. If you start to deal with visual representation of strings, calculating the column for error messages, advanced unicode-aware parsing, font rendering, and so on, then you do want to convert on the fly to wchar_t. mbsrtowcs and such are kinda bad, because they convert the whole string at once, which means an allocation that can fail in the unbounded case. It's usually sufficient to decode one wchar_t at a time with mbrtowc.

This way, char and wchar_t are not replacements for each other, but complement each other by being better abstractions for various purposes. Now, the wide stdio functions is where things start to get a bit useless, because the regular stdio char functions are perfectly fine and those functions don't really appeal well to the strengths of wchar_t.

hthh · on Feb 11, 2016

This is an interesting article, but the "Portability" comments could be a lot more useful: strndup and open_memstream are both "POSIX 2008", but strndup can be used on OS X while open_memstream cannot.

sortie · on Feb 12, 2016

I deliberately didn't write that to avoid the page going stale when OS X adds it. But OS X is behind the times and that's harmful. POSIX 2008 has been out for years and most of the lacking features are trivial to add. They're being actively harmful to Unix software by not having modern interfaces, forcing portable software to be worse. The purpose of the article is to highlight the interfaces, and when they're suited, rather than being a replacement for your system manual page or a portability guide. Since OS X isn't a free software Unix (though its libc is), I don't really consider it among the relevant modern Unix systems. Linux, the BSDs, and so on all have the POSIX 2008 features mentioned here.

apaprocki · on Feb 12, 2016

While not immediately clear, OS X is POSIX 2003. So if you go strictly by POSIX standards, you shouldn't rely on either.

to3m · on Feb 12, 2016

You'd probably be able to duplicate open_memstream on OS X using funopen.

btrask · on Feb 12, 2016

Ran into this recently. It's not open_memstream but same idea: https://github.com/NimbusKit/memorymapping

rumcajz · on Feb 12, 2016

This is an advice to use C in a way that resembles higher level languages. But it you want to do so, why not simply use a higher level language?

The power of C -- which distinguishes it from most other languages -- is the ability to allocate almost everything statically.

In fact, the older I get the more I appreciate pre-Algol60 way to allocating stack frames statically.

sortie · on Feb 12, 2016

Whether C is appropriate is highly depending on the project and context. Higher level languages offer a lot and should be used when appropriate.

But when C is appropriate, and these problems arise, which they will in any C codebase of appreciable complexity, these string creation interfaces are waiting for you, and will help you write correct code. It's usually a worthwhile effort to reconstruct higher level abstractions in C, in a good manner, for the same reasons you use them in higher level languages.

Note that statically allocating everything is hardly always possible. See the distinction between bounded and unbounded in the article, the unbounded case is really common.

kevin_thibedeau · on Feb 12, 2016

I see too much damage caused by strncpy() to ever recommend it for use. Code that is blissfully unaware of the non-guaranteed NUL or that repetitively does extra work to guarantee a NUL. Use strlcpy() if available or reimplement it.

sortie · on Feb 12, 2016

As in the article, strncpy has valid uses, but it's widely misunderstood due to the poor name. strlcpy is what the name suggests. People are also surprised by the zero padding.