Unexpected places you can and can’t use null bytes

kstenerud · on Feb 9, 2020

It seems a bit odd to call it "unexpected" when any C API that accepts a char* and doesn't include a length parameter is commonly understood to expect a null terminated string, and any API that has a length parameter likely won't have this restriction.

ferzul · on Feb 9, 2020

you missed the point. it's not that this specific api call can't take embedded \0; it's that there is no alternative api that allows embedded null. there's no way to open a file by using a string containing \0, no matter what api you pick, but you can write data to a file which contains null if you pick the right function. there's no apriori way to know which apis have this duplication and which don't.

kstenerud · on Feb 9, 2020

I don't follow what you're getting at. If there's no length field, it's going to stop scanning on the first null. If there is a length field, it will keep reading until the length is reached. That's the general contract with C APIs.

If you actually NEED strings with nulls in them (although I couldn't think of a reason why), you'll need to use/find/create APIs with length fields.

taneq · on Feb 10, 2020

> If there is a length field, it will keep reading until the length is reached.

Well, it might. If it ever uses that string internally with some function that expects a null termination then it'll probably still get truncated.

kstenerud · on Feb 10, 2020

Possibly, depending on the API. But then again, strings should not have embedded NUL characters. There's no good reason for it. You have 32 other control characters to choose from.

ferzul · on Feb 11, 2020

Okay, don't think in terms of function pairs, but in terms of use case. You might want to write unknown UTF16 to a file, and you might want to write UTF16 to a filename.

But only one of these is possible. Why? because there's an api for writing contents to a handle that takes a length, but not one for opening a file that does.

And you can't “create” api's for what the kernel doesn't support. These are limits on what is possible without forking Linux or some other kernel.

lilyball · on Feb 10, 2020

That's the whole point. The article even starts with an example of this. printf doesn't let you write strings with embedded NULs, but you can use fwrite instead to write them. The 3 cases given in the article of places where you cannot use NUL bytes are places where not only is there no API that lets you include the NULs, but it's impossible to have one (without changing the kernel interface).

kstenerud · on Feb 10, 2020

That's by design. You wouldn't want a NUL in the middle of a string in any modern system. It's such an esoteric use case that there's no point in even supporting it. There are plenty of binary data interfaces to do what you want in situations where it actually makes sense.

PHP supports NUL, and the result is a HUGE amount of extra support code to handle it, spread all through the PHP codebase (introducing TONS of bugs over the years), all for the 0.00001% use case. Not a good trade.

Complaining about lack of NUL support would be like complaining that most text editors don't display the DC1 control character. It's just not useful in 99.999% of use cases, so there's no point in supporting it. You'd use a specialized tool instead.

Joker_vD · on Feb 10, 2020

Hey. People wrote actual working SOCKS-proxies in PHP, and given that PHP's low-level functions represents an array of bytes as a string of chars, that'd be impossibly if NUL was forbidden in them.

kstenerud · on Feb 10, 2020

I see...

So rather than, say, simply using the binary array data type already present in C, they decided that they needed to subvert strings to support passing non-string data, resulting in tens of thousands of lines of support code to parallel the C runtime library, thousands more lines of bridge code for any outside libraries they might want to make use of, and almost a million lines of PHP implementation code (not including extensions!) carrying this extra cognitive burden in order to support the 1% use case, resulting in countless bugs and security vulnerabilities whenever someone used the C library functions, or stopped on NUL (or not) by mistake.

Sounds like a great plan.

magicalhippo · on Feb 9, 2020

I ran across an issue with some Win32 API call which took a char pointer and a length, but which terminated the string at the first null char. I forgot which one it is, but it was related to text boxes (edits) somehow.

kstenerud · on Feb 10, 2020

Passing strings with embedded NULs would probably break the API in other ways anyway. I'm glad they guarded against that.

Cogito · on Feb 10, 2020

> although I couldn't think of a reason why

One I've seen in the wild is an encoded data structure, a blob of null terminated strings passed around together.

kstenerud · on Feb 10, 2020

... in which case you're conceptually dealing with a structure, not a string.

Cogito · on Feb 10, 2020

Sure, once you’ve unpacked it into a structure. It’s used for sending stuff over the wire, but also for storing on disk etc.

Conceptually it was dealt with as one string, which happened to contain nulls.

kstenerud · on Feb 10, 2020

You'd be better off encoding it using ETB (end of transmision block). That's what the control code was designed for.

Cogito · on Feb 10, 2020

I can’t really speak to the why. In any case.

I remembered seeing nul in strings in git, not sure if the same place (maybe?) but a quick search shows up some real world usages https://github.com/git/git/search?q=nul

rvnx · on Feb 9, 2020

memcpy/memcmp

cpeterso · on Feb 9, 2020

Pedantic nit: the ASCII '\0' character is called NUL, not null or NULL. C strings are NUL-terminated or zero-terminated, not null-terminated.

https://en.wikipedia.org/wiki/ASCII

jfk13 · on Feb 9, 2020

IMO, you're being a bit too pedantic here; it's perfectly reasonable to refer to a "null byte" or "null character". https://en.wikipedia.org/wiki/Null_character

It's true that "NUL" is the usual abbreviation for this value in character code charts/standards.

DonHopkins · on Feb 9, 2020

I am disappointed this URL does not redirect to that page:

https://en.wikipedia.org/wiki/%00

But what the hell does it do??? In Safari and Firefox, I get an nginx 400 Bad Request page from en.wikipedia.org. But in Chrome, it seems to be redirecting to a google search for the same url, when I type it into the address bar. Well, that's meta. Chrome won't even let me drag-and-drop that icky %00 terminated url from one page into another page to navigate there -- it angrily rejects it and sadly animates the evil url back to where it came from (though dragging it into an existing or new tab mysteriously works). But actually clicking on that link immediately goes to a blank purgatory page with the url "about:blank#blocked". Those are Chrome's stories, and it's sticking with them.

At least this works:

https://en.wikipedia.org/wiki/%01

>Special Page

>Bad title

>The requested page title contains an invalid UTF-8 sequence.

>Return to Main Page.

Oh, yeah -- PHP:

https://webmasters.stackexchange.com/questions/84008/url-enc...

If not the actual NUL character, then at least the Unicode Symbol for NUL redirects to the page on the Null character.

https://en.wikipedia.org/wiki/␀

>Null character

>From Wikipedia, the free encyclopedia

>(Redirected from ␀)

>For other uses, see Null symbol.

...But then again, shouldn't the Null symbol ␀ redirect to the page on the Null symbol, which it actually is, not the page on the Null character, which it only symbolizes?

https://en.wikipedia.org/wiki/Null_symbol

missblit · on Feb 9, 2020

> But what the hell does it do??? In Safari and Firefox, I get an nginx 400 Bad Request page from en.wikipedia.org. But in Chrome, it seems to be redirecting to a google search for the same url, when I type it into the address bar.

Firefox makes a GET request to "https://en.wikipedia.org/wiki/%00", the server returns an HTTP/2 400 Bad Request. Presumably because the web-server considered the URL invalid.

Chrome decides the string isn't a valid URL up front. So it does what it normally does when you enter random junk in the address bar; it searches for it.

The dirty secret of URLs is that no one can quite agree on which ones are valid or how they should be canonicalized.

We can take WHATWG's spec as a modern way to handle URLs [1]. If I'm reading it right (50/50 chance!) the URL would be considered valid by that spec.

See also this article from the developer of curl: https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/

[1] https://url.spec.whatwg.org/

kccqzy · on Feb 10, 2020

> The dirty secret of URLs is that no one can quite agree on which ones are valid or how they should be canonicalized.

Fun story. An engineer was working to migrate an old system from Python 2 to 3 before the Python 2 EOL deadline. The engineer decided to use the str type to represent URLs. Chaos ensued when suddenly non-UTF-8 URLs don't work any more. Turns out back when that system was designed, people were directly URL-encoding binary data into URLs.

MatmaRex · on Feb 9, 2020

> ...But then again, shouldn't the Null symbol ␀ redirect to the page on the Null symbol, which it actually is, not the page on the Null character, which it only symbolizes?

Perhaps it should. That page actually did not exist when the redirect was created:

https://en.wikipedia.org/w/index.php?title=␀&action=history

https://en.wikipedia.org/w/index.php?title=Null_symbol&actio...

kevin_thibedeau · on Feb 9, 2020

NUL is explicitly 7 bits. Any other null could be larger than a machine byte.

raverbashing · on Feb 9, 2020

Even your link tells specifically the name of '\0' is Null, NUL is merely an abbreviation

See the table here https://en.wikipedia.org/wiki/ASCII#Control_characters

taftster · on Feb 9, 2020

Honest question (and not being rude). Is that strictly just being pedantic about it? Is the concept of NUL vs "NULL" semantically the same, and so you are just pointing out the ASCII abbreviation being used (three characters being consistent in the spec)? Genuinely curious if the concept of ASCII NUL '\0' can be interpreted differently in other contexts (other languages maybe)?

[edit, to clarify the question]

Like for example, the ASCII code '\0' is still a valid "thing" (it's a byte sequence, kind of). But in other contexts (other languages maybe), NULL is not a "thing" per se; it's more of a non-thing. How does a C programmer see the difference between NUL and NULL?

Sharlin · on Feb 9, 2020

The two things referred to by the terms ”NUL byte” and ”null pointer” (aka ”NULL”) are quite distinct and mixing them up is not advisable, but calling the NUL byte a NULL byte, where it’s clear from the context what is meant, is not that confusing.

thaumasiotes · on Feb 9, 2020

In the context of C, where NUL characters are most likely to be discussed, "NUL byte" and "null pointer" refer to exactly the same thing, that thing being the integer value 0.

AnimalMuppet · on Feb 9, 2020

Only if pointers are 8 bits.

More specifically, "NUL byte" is the character that has value 0, whereas "null pointer" is the pointer that has value 0.

OK, I'll try to stop being the pedant. I'll let the real pedants poke holes in what I said...

thaumasiotes · on Feb 9, 2020

> More specifically, "NUL byte" is the character that has value 0, whereas "null pointer" is the pointer that has value 0.

Where is this enforced? You can assign between notional types without a problem. Whatever the context, providing 0 will have the same effect no matter how you labeled it.

AnimalMuppet · on Feb 9, 2020

Enforced? Not, as I think you're pointing out, by prohibiting conversion. That is, if

  char c = '\0';
  void* p = (void*)0;
  int a = (int)c;
  int b = (int)p;

then

  a == b

resolves to true. In that sense, they are the same - which I believe is your point. In that point, you are completely correct.

The difference is that I can't do

*c

In that sense, they are not the same - not in the sense of numberic value, but in the sense of type. Also, c is 8 bits, and p (at least these days) is 32 or 64. (I pity anyone who ever had to work in an environment where p was 8 bits!) So they are the same numerically, but they are different both in type and in memory footprint.

thaumasiotes · on Feb 9, 2020

a == p will also resolve to true. So will c == p. a and b don't add anything to your example; they're just in there to make c and p look like they're more different than they are.

    $ cat zero.c
    #include "stdio.h"

    int main(int argc, char* argv[]) {
     char nul = 0;
     void* null = 0;

     if( nul == null ) {
      printf("compared char to pointer; they are the same\n");
     } else {
      printf("found a difference between char and pointer\n");
     }
     return 0;
    }

    $ gcc -o zero zero.c
    zero.c: In function ‘main’:
    zero.c:7:10: warning: comparison between pointer and integer
      if( nul == null ) {
              ^~
    $ ./zero
    compared char to pointer; they are the same
    $

You get a warning, but not an error, for making the comparison. By contrast, assigning the integer zero to a void* isn't even a warning -- it's just a natural thing to do. There isn't another way to set a pointer to NULL. There is another way to set a character to 0, the '\0' syntax, but that's not a warning either.

C will think nothing of adding '!' to 'P' and getting 'q'. That's not strange because addition is a pretty normal thing to do with integers. You're right that a char variable should only occupy 8 bits of memory, but that's an implementation artifact, not a theory of what the value '\0' means. That value is unambiguously the integer zero with infinite precision. The reason it only occupies 8 bits is that you can't let it have infinite bits.

AnimalMuppet · on Feb 9, 2020

Hmm. Probably true. What about c == p? I don't recall whether both will promote to numeric, or whether it's a type error.

For that matter, what does (uncasted) p = c do? How about c = p?

[Edit: You updated while I was writing; my first question you already answered. Warning, not error.]

thaumasiotes · on Feb 9, 2020

    $ cat pointers.c 
    #include "stdio.h"

    int main(int argc, char* argv[]) {
     char Z = 'Z';
     char q = 'q';
     void* null = 0;

     printf("Z is \\x%02x\n", Z);
     printf("But if it were a pointer, it would be %08x\n", Z);
     printf("Watch this:\n\n");

     null = Z;

     /* %p to print a pointer value */
     printf("Our void* is now: %p\n", null);

     q = null;

     printf("And q is: %c\n", q);

     return 0;
    }
    $ gcc -o pointers pointers.c
    pointers.c: In function ‘main’:
    pointers.c:12:7: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
      null = Z;
           ^
    pointers.c:17:4: warning: assignment makes integer from pointer without a cast [-Wint-conversion]
      q = null;
        ^
    $ ./pointers
    Z is \x5a
    But if it were a pointer, it would be 0000005a
    Watch this:

    Our void* is now:     0x5a
    And q is: Z
    $

The assignments are warnings. They work just like you'd expect them to work.

Notice all the different printf flags? This is why you need them.

spc476 · on Feb 10, 2020

The literal 0 in a pointer context will be converted to a NULL pointer, which can be a non-zero bit pattern (there are some systems where the actual NULL pointer isn't all zeros). Going through a variable might not do what you think. So this:

    char *p = 0;

is fine, but

    intptr_t a = 0;
    char *p = (char *)a;

might not do what you expect (set p to the NULL pointer).

Sharlin · on Feb 10, 2020

That’s not accurate. NUL byte is a byte with an all-zeros bit pattern, where a null pointer is a special pointer value (and thus pointer-sized) that can be coerced from the integer literal 0, but not in general from an arbitrary integer with value zero, and what’s more, a null pointer value is not guaranteed to have an all-zero bit pattern!

sbmassey · on Feb 9, 2020

Fine as long as you call the backspace character BS and the tab character HT.

earenndil · on Feb 10, 2020

But then you have to call vertical tab VT, and then you have no name for a virtual teletypewriter.

kelnage · on Feb 9, 2020

Better let Wikipedia know that then! https://en.wikipedia.org/wiki/Null-terminated_string

mdiesel · on Feb 9, 2020

I agree, and I think it's an important distinction as sizeof('\0') is not the same as sizeof((void*)0).

However, the standard does talk about null characters throughout as being a character code of value zero, as opposed to NULL the macro.

taneq · on Feb 10, 2020

I used to be pedantic about this one, now I just use 'null' because that's what everyone else uses and so they'll know what I'm talking about. If I really want to be picky I'll use '\0'.

monocasa · on Feb 9, 2020

So the BEL character rings a bel on the user's terminal?

DonHopkins · on Feb 9, 2020

nneonneo · on Feb 10, 2020

Unexpected places where you can use null bytes: gets, fgets and scanf("%s"). All three will read and store null bytes into your string from the input, and keep going: gets and fgets only terminate at a newline character and scanf only terminates at whitespace (which doesn't include the null byte).

gets and scanf("%s") are also horrifically unsafe. gets is well-known to be unsafe (to the point where you'll almost certainly get a compiler warning for using it). However, scanf("%s") is unsafe for exactly the same reason (no bound on the buffer length) yet will not produce a compiler warning. Add to the fact that these functions will accept null bytes, and you have a very dangerous buffer overflow waiting to happen.

fao_ · on Feb 10, 2020

This is why you _always_ write:

    if (*s && *s != '\n' ...)

and never:

    if (*s != '\n' ...)

msarnoff · on Feb 9, 2020

One unexpected place where null bytes are acceptable: Wi-Fi SSIDs. That’s one way to keep people off your network, I suppose.

ChrisSD · on Feb 9, 2020

> While we’re on the topic, it’s worth noting that the only other restriction on filenames is that that they cannot contain a /, which is the character used to denote directories. Filenames can contain arbitrary other binary data, including spaces and newlines, and there’s no defined character encoding.

I've seen this byte people when junk gets written to a filename (either accidentally or maliciously). Especially in shells but also in other programming languages. Issues that aren't always handled well include file names that:

* include a newline or some other control characters

* start with a `-`

* aren't valid UTF-8

DonHopkins · on Feb 9, 2020

GatorBoxes (an AppleTalk<=>Ethernet Apple File Sharing / NFS bridge) would let Mac users create files in the Unix file system with slashes in their names (presumably because of NFS shenanigans going on in the kernel).

Macs let you use "/" in file names, but instead used ":" as the path separation character.

It seemed to work fine and dandy at the time, until you ran "restore" and discovered your backups were corrupted!

https://en.wikipedia.org/wiki/GatorBox

https://news.ycombinator.com/item?id=20007875

Just for fun: guess what happens when you create a Mac file name with a slash in it today?

It works!

Or it seems to work. But behind the scenes, the Finder and Mac user interface libraries actually convert the "/" to a ":", which you can see with "ls". But at least it doesn't corrupt your backups!

nneonneo · on Feb 10, 2020

Classic MacOS used : for the directory separator, so you could freely put / in your filenames. Modern macOS converts / in the UI to : under the hood in order to retain compatibility, while banning : in the UI (the same restriction that classic MacOS had).

naedish · on Feb 9, 2020

Aren't the filenames . and .. also restricted in Unix?

Windows has many more restrictions. [1]

[1] https://kizu514.com/blog/forbidden-file-names-on-windows-10/

ChrisSD · on Feb 9, 2020

Note that "forbidden" filenames only apply to win32 paths. "Extended-length" paths can be used to create and open such files. This can cause issues with applications that don't know how to handle them.

But yes, Windows reserves (16 bit encoded) ASCII control characters and some special symbols like question mark and colon.

As for . and .. I believe that on UNIX they are just symlinks, though they may be treated specially by applications? On Windows they don't really exist but some APIs emulate them when resolving paths.

cesarb · on Feb 9, 2020

> As for . and .. I believe that on UNIX they are just symlinks, though they may be treated specially by applications?

No, they are treated specially by the kernel and/or the filesystem. They behave similar to hardlinks to the corresponding directory (hardlinks to directories aren't allowed nowadays, these two are an exception). The special treatment by applications is that many applications hide all files and directories starting with a dot, which happens to also apply to these two.

ChrisSD · on Feb 9, 2020

Ah, thanks for the correction.

malkia · on Feb 9, 2020

Also peculiar, in Windows, is handling of files that have spaces at the end :) - I've seen in my gamedev career more than one case, where a tool doesn't do enough trimming on spaces only to bite us later.

heavenlyblue · on Feb 9, 2020

I have recently accidentally created a folder named “~”. Then I tried deleting it through shell.

bhaak · on Feb 9, 2020

Deleting a dir with rmdir is always safe :-)

The other trick that makes dangerous operations always more safe is prepending "./" to the path.

NullPrefix · on Feb 9, 2020

You can "safely" delete it with non-recursive rmdir

smichel17 · on Feb 9, 2020

I did similar, once. I had a script where I misquoted and ended up with a directory starting with ~

It was the only directory starting that way, so I typed "rm -rf ~<TAB><ENTER>"

Hit control-C a half second later, but the damage was done. Fortunately most of my important files are backed up.

Lesson: when deleting files with tricky names, write the command without flags first, then add "-rf" after the path is confirmed correct.

heavenlyblue · on Feb 9, 2020

I’ve got lucky with some directory that is called .asound or similar. It was the first one in the home directory and it didn’t manage to go beyond that.

Someone · on Feb 9, 2020

There are two distinct sets of restrictions on file names: those of the OS API (in United, that means no null bytes, / is a directory separator), and those of the file systems in the path, which can be (almost) anything.

Quite a few file systems require names to be valid text in a specific encoding, not arbitrary byte sequences.

Also, note the use of a plural in “the file systems in the path”. A file systems mounted in another one can change the rules halfway-through. There is not single fixed set of restrictions to file names.

cdcarter · on Feb 9, 2020

Neat article, but the author doesn't provide an actual reason you'd _need_ to pass a NUL byte into something like a socket address, or command arg.

Is there an (perhaps obvious, or not) common usage of NUL byte literals being passed around, not for the purpose of terminating strings? Just terrible ye-olde file formats?

singron · on Feb 9, 2020

All kinds of binary data might have a NUL byte. E.g. if you want to write a NUL byte to a file in the shell, you might try something like

    echo -n $'\0' > nul

This doesn't work for the reason stated in the article. The argument is instead interpreted as a string ended at the NUL byte and the file will be empty. BTW you can get around this with printf since it processes escape sequences internally.

    printf '\0' > nul

nneonneo · on Feb 10, 2020

There's also `echo -ne '\0'`, which works similarly (it tells echo to interpret the escape sequences).

jjnoakes · on Feb 9, 2020

The article does mention one usage, near the end, starting with the text "The “abstract” socket feature..."

coolreader18 · on Feb 10, 2020

Others have mentioned "arbitrary binary data", but a concrete example of that (that I know of, not that would be particularly useful to pass around) is the WASM bytecode format: it starts with '\0ASM', using the NUL byte to signify to tools reading it that it isn't text and they shouldn't attempt to display it (or anything like that). However, it is a _byte_code format, and as such, it would be represented with a `char*` pointer in C. Yet, any APIs that use NUL terminated [byte]strings would stop at the very first byte, and not keep any more data beyond than that.

rwmj · on Feb 9, 2020

Maybe not particularly useful but we have an nbdkit plugin that lets you specify disks on the command line. If the disk contains any zero bytes (which is very common for disk images obviously) then you have to use base64 encoding instead of directly encoding them, for example by using $(echo -e). http://libguestfs.org/nbdkit-data-plugin.1.html

warkdarrior · on Feb 9, 2020

File names in UTF-32 encoding?

detaro · on Feb 9, 2020

One common usage: NUL bytes are useful as a separator between file names etc if you want to be prepared for any possible name. E.g. many commandline tools have a "-print0" flag to switch from newlines to NUL-bytes in their output.

fwsgonzo · on Feb 10, 2020

You can printf a null byte just fine, you just need to provide the length, just like with fwrite:

printf("%.*s", (int) nbytes, str);

skrebbel · on Feb 9, 2020

PostgreSQL TEXT fields