Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Unexpected places you can and can’t use null bytes (eklitzke.org)
63 points by ammmir on Feb 9, 2020 | hide | past | favorite | 71 comments


It seems a bit odd to call it "unexpected" when any C API that accepts a char* and doesn't include a length parameter is commonly understood to expect a null terminated string, and any API that has a length parameter likely won't have this restriction.


you missed the point. it's not that this specific api call can't take embedded \0; it's that there is no alternative api that allows embedded null. there's no way to open a file by using a string containing \0, no matter what api you pick, but you can write data to a file which contains null if you pick the right function. there's no apriori way to know which apis have this duplication and which don't.


I don't follow what you're getting at. If there's no length field, it's going to stop scanning on the first null. If there is a length field, it will keep reading until the length is reached. That's the general contract with C APIs.

If you actually NEED strings with nulls in them (although I couldn't think of a reason why), you'll need to use/find/create APIs with length fields.


> If there is a length field, it will keep reading until the length is reached.

Well, it might. If it ever uses that string internally with some function that expects a null termination then it'll probably still get truncated.


Possibly, depending on the API. But then again, strings should not have embedded NUL characters. There's no good reason for it. You have 32 other control characters to choose from.


Okay, don't think in terms of function pairs, but in terms of use case. You might want to write unknown UTF16 to a file, and you might want to write UTF16 to a filename.

But only one of these is possible. Why? because there's an api for writing contents to a handle that takes a length, but not one for opening a file that does.

And you can't “create” api's for what the kernel doesn't support. These are limits on what is possible without forking Linux or some other kernel.


That's the whole point. The article even starts with an example of this. printf doesn't let you write strings with embedded NULs, but you can use fwrite instead to write them. The 3 cases given in the article of places where you cannot use NUL bytes are places where not only is there no API that lets you include the NULs, but it's impossible to have one (without changing the kernel interface).


That's by design. You wouldn't want a NUL in the middle of a string in any modern system. It's such an esoteric use case that there's no point in even supporting it. There are plenty of binary data interfaces to do what you want in situations where it actually makes sense.

PHP supports NUL, and the result is a HUGE amount of extra support code to handle it, spread all through the PHP codebase (introducing TONS of bugs over the years), all for the 0.00001% use case. Not a good trade.

Complaining about lack of NUL support would be like complaining that most text editors don't display the DC1 control character. It's just not useful in 99.999% of use cases, so there's no point in supporting it. You'd use a specialized tool instead.


Hey. People wrote actual working SOCKS-proxies in PHP, and given that PHP's low-level functions represents an array of bytes as a string of chars, that'd be impossibly if NUL was forbidden in them.


I see...

So rather than, say, simply using the binary array data type already present in C, they decided that they needed to subvert strings to support passing non-string data, resulting in tens of thousands of lines of support code to parallel the C runtime library, thousands more lines of bridge code for any outside libraries they might want to make use of, and almost a million lines of PHP implementation code (not including extensions!) carrying this extra cognitive burden in order to support the 1% use case, resulting in countless bugs and security vulnerabilities whenever someone used the C library functions, or stopped on NUL (or not) by mistake.

Sounds like a great plan.


I ran across an issue with some Win32 API call which took a char pointer and a length, but which terminated the string at the first null char. I forgot which one it is, but it was related to text boxes (edits) somehow.


Passing strings with embedded NULs would probably break the API in other ways anyway. I'm glad they guarded against that.


> although I couldn't think of a reason why

One I've seen in the wild is an encoded data structure, a blob of null terminated strings passed around together.


... in which case you're conceptually dealing with a structure, not a string.


Sure, once you’ve unpacked it into a structure. It’s used for sending stuff over the wire, but also for storing on disk etc.

Conceptually it was dealt with as one string, which happened to contain nulls.


You'd be better off encoding it using ETB (end of transmision block). That's what the control code was designed for.


I can’t really speak to the why. In any case.

I remembered seeing nul in strings in git, not sure if the same place (maybe?) but a quick search shows up some real world usages https://github.com/git/git/search?q=nul


memcpy/memcmp


Pedantic nit: the ASCII '\0' character is called NUL, not null or NULL. C strings are NUL-terminated or zero-terminated, not null-terminated.

https://en.wikipedia.org/wiki/ASCII


IMO, you're being a bit too pedantic here; it's perfectly reasonable to refer to a "null byte" or "null character". https://en.wikipedia.org/wiki/Null_character

It's true that "NUL" is the usual abbreviation for this value in character code charts/standards.


I am disappointed this URL does not redirect to that page:

https://en.wikipedia.org/wiki/%00

But what the hell does it do??? In Safari and Firefox, I get an nginx 400 Bad Request page from en.wikipedia.org. But in Chrome, it seems to be redirecting to a google search for the same url, when I type it into the address bar. Well, that's meta. Chrome won't even let me drag-and-drop that icky %00 terminated url from one page into another page to navigate there -- it angrily rejects it and sadly animates the evil url back to where it came from (though dragging it into an existing or new tab mysteriously works). But actually clicking on that link immediately goes to a blank purgatory page with the url "about:blank#blocked". Those are Chrome's stories, and it's sticking with them.

At least this works:

https://en.wikipedia.org/wiki/%01

>Special Page

>Bad title

>The requested page title contains an invalid UTF-8 sequence.

>Return to Main Page.

Oh, yeah -- PHP:

https://webmasters.stackexchange.com/questions/84008/url-enc...

If not the actual NUL character, then at least the Unicode Symbol for NUL redirects to the page on the Null character.

https://en.wikipedia.org/wiki/␀

>Null character

>From Wikipedia, the free encyclopedia

>(Redirected from ␀)

>For other uses, see Null symbol.

...But then again, shouldn't the Null symbol ␀ redirect to the page on the Null symbol, which it actually is, not the page on the Null character, which it only symbolizes?

https://en.wikipedia.org/wiki/Null_symbol


> But what the hell does it do??? In Safari and Firefox, I get an nginx 400 Bad Request page from en.wikipedia.org. But in Chrome, it seems to be redirecting to a google search for the same url, when I type it into the address bar.

Firefox makes a GET request to "https://en.wikipedia.org/wiki/%00", the server returns an HTTP/2 400 Bad Request. Presumably because the web-server considered the URL invalid.

Chrome decides the string isn't a valid URL up front. So it does what it normally does when you enter random junk in the address bar; it searches for it.

The dirty secret of URLs is that no one can quite agree on which ones are valid or how they should be canonicalized.

We can take WHATWG's spec as a modern way to handle URLs [1]. If I'm reading it right (50/50 chance!) the URL would be considered valid by that spec.

See also this article from the developer of curl: https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/

[1] https://url.spec.whatwg.org/


> The dirty secret of URLs is that no one can quite agree on which ones are valid or how they should be canonicalized.

Fun story. An engineer was working to migrate an old system from Python 2 to 3 before the Python 2 EOL deadline. The engineer decided to use the str type to represent URLs. Chaos ensued when suddenly non-UTF-8 URLs don't work any more. Turns out back when that system was designed, people were directly URL-encoding binary data into URLs.


> ...But then again, shouldn't the Null symbol ␀ redirect to the page on the Null symbol, which it actually is, not the page on the Null character, which it only symbolizes?

Perhaps it should. That page actually did not exist when the redirect was created:

https://en.wikipedia.org/w/index.php?title=␀&action=history

https://en.wikipedia.org/w/index.php?title=Null_symbol&actio...


NUL is explicitly 7 bits. Any other null could be larger than a machine byte.


Even your link tells specifically the name of '\0' is Null, NUL is merely an abbreviation

See the table here https://en.wikipedia.org/wiki/ASCII#Control_characters


Honest question (and not being rude). Is that strictly just being pedantic about it? Is the concept of NUL vs "NULL" semantically the same, and so you are just pointing out the ASCII abbreviation being used (three characters being consistent in the spec)? Genuinely curious if the concept of ASCII NUL '\0' can be interpreted differently in other contexts (other languages maybe)?

[edit, to clarify the question]

Like for example, the ASCII code '\0' is still a valid "thing" (it's a byte sequence, kind of). But in other contexts (other languages maybe), NULL is not a "thing" per se; it's more of a non-thing. How does a C programmer see the difference between NUL and NULL?


The two things referred to by the terms ”NUL byte” and ”null pointer” (aka ”NULL”) are quite distinct and mixing them up is not advisable, but calling the NUL byte a NULL byte, where it’s clear from the context what is meant, is not that confusing.


In the context of C, where NUL characters are most likely to be discussed, "NUL byte" and "null pointer" refer to exactly the same thing, that thing being the integer value 0.


Only if pointers are 8 bits.

More specifically, "NUL byte" is the character that has value 0, whereas "null pointer" is the pointer that has value 0.

OK, I'll try to stop being the pedant. I'll let the real pedants poke holes in what I said...


> More specifically, "NUL byte" is the character that has value 0, whereas "null pointer" is the pointer that has value 0.

Where is this enforced? You can assign between notional types without a problem. Whatever the context, providing 0 will have the same effect no matter how you labeled it.


Enforced? Not, as I think you're pointing out, by prohibiting conversion. That is, if

  char c = '\0';
  void* p = (void*)0;
  int a = (int)c;
  int b = (int)p;
then

  a == b
resolves to true. In that sense, they are the same - which I believe is your point. In that point, you are completely correct.

The difference is that I can't do

  *c
In that sense, they are not the same - not in the sense of numberic value, but in the sense of type. Also, c is 8 bits, and p (at least these days) is 32 or 64. (I pity anyone who ever had to work in an environment where p was 8 bits!) So they are the same numerically, but they are different both in type and in memory footprint.


a == p will also resolve to true. So will c == p. a and b don't add anything to your example; they're just in there to make c and p look like they're more different than they are.

    $ cat zero.c
    #include "stdio.h"

    int main(int argc, char* argv[]) {
     char nul = 0;
     void* null = 0;

     if( nul == null ) {
      printf("compared char to pointer; they are the same\n");
     } else {
      printf("found a difference between char and pointer\n");
     }
     return 0;
    }

    $ gcc -o zero zero.c
    zero.c: In function ‘main’:
    zero.c:7:10: warning: comparison between pointer and integer
      if( nul == null ) {
              ^~
    $ ./zero
    compared char to pointer; they are the same
    $
You get a warning, but not an error, for making the comparison. By contrast, assigning the integer zero to a void* isn't even a warning -- it's just a natural thing to do. There isn't another way to set a pointer to NULL. There is another way to set a character to 0, the '\0' syntax, but that's not a warning either.

C will think nothing of adding '!' to 'P' and getting 'q'. That's not strange because addition is a pretty normal thing to do with integers. You're right that a char variable should only occupy 8 bits of memory, but that's an implementation artifact, not a theory of what the value '\0' means. That value is unambiguously the integer zero with infinite precision. The reason it only occupies 8 bits is that you can't let it have infinite bits.


Hmm. Probably true. What about c == p? I don't recall whether both will promote to numeric, or whether it's a type error.

For that matter, what does (uncasted) p = c do? How about c = p?

[Edit: You updated while I was writing; my first question you already answered. Warning, not error.]


    $ cat pointers.c 
    #include "stdio.h"

    int main(int argc, char* argv[]) {
     char Z = 'Z';
     char q = 'q';
     void* null = 0;

     printf("Z is \\x%02x\n", Z);
     printf("But if it were a pointer, it would be %08x\n", Z);
     printf("Watch this:\n\n");

     null = Z;

     /* %p to print a pointer value */
     printf("Our void* is now: %p\n", null);

     q = null;

     printf("And q is: %c\n", q);

     return 0;
    }
    $ gcc -o pointers pointers.c
    pointers.c: In function ‘main’:
    pointers.c:12:7: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
      null = Z;
           ^
    pointers.c:17:4: warning: assignment makes integer from pointer without a cast [-Wint-conversion]
      q = null;
        ^
    $ ./pointers
    Z is \x5a
    But if it were a pointer, it would be 0000005a
    Watch this:

    Our void* is now:     0x5a
    And q is: Z
    $
The assignments are warnings. They work just like you'd expect them to work.

Notice all the different printf flags? This is why you need them.


The literal 0 in a pointer context will be converted to a NULL pointer, which can be a non-zero bit pattern (there are some systems where the actual NULL pointer isn't all zeros). Going through a variable might not do what you think. So this:

    char *p = 0;
is fine, but

    intptr_t a = 0;
    char *p = (char *)a;
might not do what you expect (set p to the NULL pointer).


That’s not accurate. NUL byte is a byte with an all-zeros bit pattern, where a null pointer is a special pointer value (and thus pointer-sized) that can be coerced from the integer literal 0, but not in general from an arbitrary integer with value zero, and what’s more, a null pointer value is not guaranteed to have an all-zero bit pattern!


Fine as long as you call the backspace character BS and the tab character HT.


But then you have to call vertical tab VT, and then you have no name for a virtual teletypewriter.


Better let Wikipedia know that then! https://en.wikipedia.org/wiki/Null-terminated_string


I agree, and I think it's an important distinction as sizeof('\0') is not the same as sizeof((void*)0).

However, the standard does talk about null characters throughout as being a character code of value zero, as opposed to NULL the macro.


I used to be pedantic about this one, now I just use 'null' because that's what everyone else uses and so they'll know what I'm talking about. If I really want to be picky I'll use '\0'.


So the BEL character rings a bel on the user's terminal?


ACK!


Unexpected places where you can use null bytes: gets, fgets and scanf("%s"). All three will read and store null bytes into your string from the input, and keep going: gets and fgets only terminate at a newline character and scanf only terminates at whitespace (which doesn't include the null byte).

gets and scanf("%s") are also horrifically unsafe. gets is well-known to be unsafe (to the point where you'll almost certainly get a compiler warning for using it). However, scanf("%s") is unsafe for exactly the same reason (no bound on the buffer length) yet will not produce a compiler warning. Add to the fact that these functions will accept null bytes, and you have a very dangerous buffer overflow waiting to happen.


This is why you _always_ write:

    if (*s && *s != '\n' ...)
and never:

    if (*s != '\n' ...)


One unexpected place where null bytes are acceptable: Wi-Fi SSIDs. That’s one way to keep people off your network, I suppose.


> While we’re on the topic, it’s worth noting that the only other restriction on filenames is that that they cannot contain a /, which is the character used to denote directories. Filenames can contain arbitrary other binary data, including spaces and newlines, and there’s no defined character encoding.

I've seen this byte people when junk gets written to a filename (either accidentally or maliciously). Especially in shells but also in other programming languages. Issues that aren't always handled well include file names that:

* include a newline or some other control characters

* start with a `-`

* aren't valid UTF-8


GatorBoxes (an AppleTalk<=>Ethernet Apple File Sharing / NFS bridge) would let Mac users create files in the Unix file system with slashes in their names (presumably because of NFS shenanigans going on in the kernel).

Macs let you use "/" in file names, but instead used ":" as the path separation character.

It seemed to work fine and dandy at the time, until you ran "restore" and discovered your backups were corrupted!

https://en.wikipedia.org/wiki/GatorBox

https://news.ycombinator.com/item?id=20007875

Just for fun: guess what happens when you create a Mac file name with a slash in it today?

It works!

Or it seems to work. But behind the scenes, the Finder and Mac user interface libraries actually convert the "/" to a ":", which you can see with "ls". But at least it doesn't corrupt your backups!


Classic MacOS used : for the directory separator, so you could freely put / in your filenames. Modern macOS converts / in the UI to : under the hood in order to retain compatibility, while banning : in the UI (the same restriction that classic MacOS had).


Aren't the filenames . and .. also restricted in Unix?

Windows has many more restrictions. [1]

[1] https://kizu514.com/blog/forbidden-file-names-on-windows-10/


Note that "forbidden" filenames only apply to win32 paths. "Extended-length" paths can be used to create and open such files. This can cause issues with applications that don't know how to handle them.

But yes, Windows reserves (16 bit encoded) ASCII control characters and some special symbols like question mark and colon.

As for . and .. I believe that on UNIX they are just symlinks, though they may be treated specially by applications? On Windows they don't really exist but some APIs emulate them when resolving paths.


> As for . and .. I believe that on UNIX they are just symlinks, though they may be treated specially by applications?

No, they are treated specially by the kernel and/or the filesystem. They behave similar to hardlinks to the corresponding directory (hardlinks to directories aren't allowed nowadays, these two are an exception). The special treatment by applications is that many applications hide all files and directories starting with a dot, which happens to also apply to these two.


Ah, thanks for the correction.


Also peculiar, in Windows, is handling of files that have spaces at the end :) - I've seen in my gamedev career more than one case, where a tool doesn't do enough trimming on spaces only to bite us later.


I have recently accidentally created a folder named “~”. Then I tried deleting it through shell.


Deleting a dir with rmdir is always safe :-)

The other trick that makes dangerous operations always more safe is prepending "./" to the path.


You can "safely" delete it with non-recursive rmdir


I did similar, once. I had a script where I misquoted and ended up with a directory starting with ~

It was the only directory starting that way, so I typed "rm -rf ~<TAB><ENTER>"

Hit control-C a half second later, but the damage was done. Fortunately most of my important files are backed up.

Lesson: when deleting files with tricky names, write the command without flags first, then add "-rf" after the path is confirmed correct.


I’ve got lucky with some directory that is called .asound or similar. It was the first one in the home directory and it didn’t manage to go beyond that.


There are two distinct sets of restrictions on file names: those of the OS API (in United, that means no null bytes, / is a directory separator), and those of the file systems in the path, which can be (almost) anything.

Quite a few file systems require names to be valid text in a specific encoding, not arbitrary byte sequences.

Also, note the use of a plural in “the file systems in the path”. A file systems mounted in another one can change the rules halfway-through. There is not single fixed set of restrictions to file names.


Neat article, but the author doesn't provide an actual reason you'd _need_ to pass a NUL byte into something like a socket address, or command arg.

Is there an (perhaps obvious, or not) common usage of NUL byte literals being passed around, not for the purpose of terminating strings? Just terrible ye-olde file formats?


All kinds of binary data might have a NUL byte. E.g. if you want to write a NUL byte to a file in the shell, you might try something like

    echo -n $'\0' > nul
This doesn't work for the reason stated in the article. The argument is instead interpreted as a string ended at the NUL byte and the file will be empty. BTW you can get around this with printf since it processes escape sequences internally.

    printf '\0' > nul


There's also `echo -ne '\0'`, which works similarly (it tells echo to interpret the escape sequences).


The article does mention one usage, near the end, starting with the text "The “abstract” socket feature..."


Others have mentioned "arbitrary binary data", but a concrete example of that (that I know of, not that would be particularly useful to pass around) is the WASM bytecode format: it starts with '\0ASM', using the NUL byte to signify to tools reading it that it isn't text and they shouldn't attempt to display it (or anything like that). However, it is a _byte_code format, and as such, it would be represented with a `char*` pointer in C. Yet, any APIs that use NUL terminated [byte]strings would stop at the very first byte, and not keep any more data beyond than that.


Maybe not particularly useful but we have an nbdkit plugin that lets you specify disks on the command line. If the disk contains any zero bytes (which is very common for disk images obviously) then you have to use base64 encoding instead of directly encoding them, for example by using $(echo -e). http://libguestfs.org/nbdkit-data-plugin.1.html


File names in UTF-32 encoding?


One common usage: NUL bytes are useful as a separator between file names etc if you want to be prepared for any possible name. E.g. many commandline tools have a "-print0" flag to switch from newlines to NUL-bytes in their output.


You can printf a null byte just fine, you just need to provide the length, just like with fwrite:

printf("%.*s", (int) nbytes, str);


PostgreSQL TEXT fields




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: