The zero fd is valid but will never happen in this code.
Checking the return value of malloc is a bad idea (just rig the program to explode when any malloc fails). But casting the return value of malloc is incorrect, and that code should be using calloc.
How do you know? What if someone forked this program and closed fd 0 before execX()? You have to abide by the posix standard and what the man page tells you. Now it never happens, but later the code can change and the case can hit. Most probably you'd have forgotten about this check.
Malloc failures don't always have to result in aborting the program. Cases can vary.
My suggestion to you is abide by the man page and always check error conditions. Don't overthink the failure cases of stdlib & posix calls. You never know the OS/environment your program will run in, and the only common ground you have is the standard.
You don't know. It's true that the line of code he cited was incorrect. It's just not a very interesting example of incorrect code.
Malloc failures don't always result in aborting. The common alternative, especially in programs that have careful malloc return value checking regimes, is to occasionally cough up remote code execution.
Userland systems programmers should assume the conservative default of ending the program immediately when malloc fails.
A similar logic guided C++ into throwing bad_alloc instead of returning NULL on allocation failures. And a survey of modern C++ code will show you that most C++ programs simply allow themselves to terminate when bad_alloc happens.
It doesn't have to be interesting. Incorrect code covers a spectrum and is not black or white. The term "edge-case" describes it pretty well.
Regarding mallocs, again you don't have to overthink error-handling. Just handle it and figure out how to deal with it if it happens.
C++ has nothrow and try/catch. If you want to catch a failed new or abort is something up to the user. One thing for sure is that C++ aborts on failed new instead of stumping on unallocated memory.
On the other hand, consistency is important. Some parts of the code handled fd correctly and malloc failures and other parts did not.
I recommend you applaud good programming practices and criticize bad ones. Defending/arguing bad programming practices(whether they are edge cases or not) doesn't help.
He clearly has never heard of mallopt (http://man7.org/linux/man-pages/man3/mallopt.3.html) so he doesn't know that instead of wrapping every single malloc with an if statement you can just tell malloc to abrt if it ever fails to allocate memory.
In his defense most programs don't test the return of malloc and don't set the abort on failure so most programmers would just assume that if your not checking your return values that you made a mistake.
If a daemon is not capable of surviving memory exhaustion, it is not suitable for deployment.
If a programmer is not capable of writing software that can survive memory exhaustion, they are not suitable for writing daemons.
I'll not argue the point further; the HN lynchmob is already beating at my door for daring to disagree with a site darling. Had I looked at your username before replying, I probably would have forborn from violating the bubble.
What's a single example of a daemon with more than 10,000 lines of C code that will reliably "survive" memory exhaustion? By "survive" we obviously both mean that it retains its original PID. Of course, most Unix serverside code "survives" by simply aborting and allowing itself to be restarted. Since that's what I'm recommending, that strategy doesn't count.
To find one, you're going to have to catalog every malloc() in the entire program and record some kind of recovery regime --- maybe degraded performance, maybe dropped requests, maybe fallback to some kind of pool --- for every allocation.
An in-memory message queue server or memcached variations don't need to crash if it can't serve a particular request that requires more memory than available.
I am not sure why error-handling is a difficult concept for you.
Yes, it's suboptimal to have a server that receives a request with a 32 bit length field indicating 3 gigs of incoming data that the bombs out trying to malloc 32 gigs.
I'm saying: what's an example of a server that reliably, for every allocation of every piece of metadata, every strdup, every hash table entry, every connection object, &c &c, has a recovery regime so that it doesn't have to fail ever when malloc does?
One way you could find such a program would be to compile a candidate and preload a malloc that randomly (1 in every 100 calls per allocation size, for instance) returns NULL. See if the program (a) continues to run and (b) passes some simple unit test suite.
My contention is:
(a) Those programs will be hard to find, because most "production ready" Unix code does not have that property, and
(b) The criteria I'm talking about matters a lot, because (1) memory exhaustion strikes at totally arbitrary points in a program's execution, not just at the points where you're prepared to handle it, and (2) attackers can pinpoint exactly the allocation they want to have fail.
As an example of an approach that doesn't seem to work well: in "memqueue", the function that creates HTTP headers in responses allocates an array of iovecs. If that malloc fails, the header creation function returns -1. The function that calls the header-creation function, http_respond, catches that error and itself returns -1. Nothing ever checks the error return from http_respond.
(Also, the loop in which memqueue reads entire requests into memory by continuously realloc()'ing a receive buffer has an integer wrap bug in it, though it's probably not triggerable. But incorrect is incorrect, right?)
Also worth noting: under normal conditions on Linux, malloc will never fail--unless you set something like a rlimit on address space, mmap will happily hand you as much address space as you want, then OOM-kill you (or worse, someone else...) when you try to make it resident. So even if we could write malloc-failure-safe software, which we can't, it'd be almost impossible to end up in that condition.
you don't have to receive a request asking for 3 gigs of data. You could be close to the edge and there's not enough room to receive this particular request.
There are many daemons that require you to keep what you have in memory and just fail the particular action you are doing instead of throwing everything away. DB servers for example.
memory exhaustion is an error that should be handled like other errors.
in memqueue, http_respond logs the failed memory allocation(in http_cli_resp_hdr_create) and returns -1. there's nothing else I need to do in this case. The connection will get dropped without a response.
when reallocing, I don't see the integer wrap bug. Can you point me to the line?
What I am suggesting is to not overthink system error handling. Just handle it; aborting is one type of handling but not always what you want. Programs run in various environments and to guarantee a defined behavior we need to abide by the standard.
Then you've misunderstood me, or I've miscommunicated. My argument is that the default handling strategy should be to abort. I'm not saying that special case handling is evil. I'm saying that defaulting to manually checking malloc's return value is evil.
Also, your chunked encoding decoder seems to be using a signed strtol() routine to read an unsigned length variable. I could be misreading; I didn't look carefully.
It depends on the goal in the end. As long as it's an explicit decision and not relying on environment the behavior is expected to be defined.
For instance, I worked on an enterprise proxy where aborting on asserts wasn't acceptable. Why? Because the customer didn't want to interrupt his users even though in our opinion the proxy state was out of whack. This created a nightmare for us because it was hard to debug. We ended up fork()-ing and aborting on the side to debug the cores.
"The key feature is a built-in custom TCP/IP stack capable of handling millions of DNS queries-per-second per CPU core."
DNS uses UDP primarily. I suspect the author meant "UDP/TCP/IP" by "TCP/IP".
Years ago, I wrote a basic TCP stack for a honeypot research project. It is hard and incredibly complex. So this statement raises a number of concerns, and will need to be audited before being used in production.
TCP/IP is the common name for the Internet Protocol Suite which UDP is part of.
Also since many mobile clients tend to issue requests over TCP (due to it being more reliable over mobile), and that many responses can be larger than 512 bytes now and EDNS is not really a standard DNS over TCP will probably over take UDP quite quickly.
The text below is from the RFC, and no it does not relates to zone transfers but to normal DNS queries :)
All general-purpose DNS implementations MUST support both UDP and TCP
transport.
o Authoritative server implementations MUST support TCP so that they
do not limit the size of responses to what fits in a single UDP
packet.
o Recursive server (or forwarder) implementations MUST support TCP
so that they do not prevent large responses from a TCP-capable
server from reaching its TCP-capable clients.
o Stub resolver implementations (e.g., an operating system's DNS
resolution library) MUST support TCP since to do otherwise would
limit their interoperability with their own clients and with
upstream servers.
Stub resolver implementations MAY omit support for TCP when
specifically designed for deployment in restricted environments where
truncation can never occur or where truncated DNS responses are
acceptable.
And for the most important part :P
Regarding the choice of when to use UDP or TCP, Section 6.1.3.2 of
RFC 1123 also says:
... a DNS resolver or server that is sending a non-zone-transfer
query MUST send a UDP query first.
That requirement is hereby relaxed. A resolver SHOULD send a UDP
query first, but MAY elect to send a TCP query instead if it has good
reason to expect the response would be truncated if it were sent over
UDP (with or without EDNS0) or for other operational reasons, in
particular, if it already has an open TCP connection to the server.
Good call out. Maybe parent is just loosely referring to larger TXT records (or maybe DNSSEC or ipv6?) etc. My guess is that even these are probably a relatively small percentage of overall DNS traffic and that's not likely to change anytime soon.
So I can't imagine why records would all of a sudden exceed 512 bytes on avg either.
UDP isn't as reliable on mobile connections, many mobile clients issue TCP DNS requests with the UDP request at the same time and not waiting the "casual 5 seconds time out".
DNS records also seem to grow and EDNS has not been adopted very well.
Other things like Crhome's async DNS prefetch also seem to use TCP as much as UDP for some reason, especially to google DNS servers.
The updated RFC mandates TCP support for regular DNS, and although i don't have a single reason (other than IPV6 records, TXT records use and DNSSEC) for why i have a strong feeling that mobile and browser optimizations are a good reason for that.
When your browser does DNS pre-fetch from the DOM it becomes much more efficient to open a single TCP connection and issue all of the DNS requests (and with CDN's, adds, capthcas, social media and 3rd party content you can easily get to 20+ distinct DNS records per page) over it rather than issue individual async DNS queries over UDP.
This will both be faster and more importantly more reliable for the next step which is the TCP-preconnect once it has resolved all the DNS records from the DOM even before loading it fully.
When your browser does DNS pre-fetch from the DOM it becomes much more efficient to open a single TCP connection and issue all of the DNS requests (and with CDN's, adds, capthcas, social media and 3rd party content you can easily get to 20+ distinct DNS records per page) over it rather than issue individual async DNS queries over UDP.
How are there any fewer RTTs with TCP DNS than there would be with UDP? I'm not seeing the efficiency here.
There is a per-packet cost to processing requests. TCP can bundle them; UDP cannot. Is it large enough to matter? I don't know. But the cost includes evaluating context of the requesting entity, which might not be meaningful for DNS queries. Maybe if they are related, some working-set caching would occur.
RTT might not improve at all. But lag might. Scripts often make the mistake of asking for information when they need it. Instead of before they need it, so it will be ready when needed. The suggested approach would pre-load the DNS info and might reduce lag.
TCP sends successive data elements together, at least as part of the Nagle algorithm. I have no idea if the way DNS uses TCP can trigger Nagle.
In fact my own UDP protocol does something similar to Nagle as well. There's no good reason UDP protocols can't pick and choose what features they include. But most don't.
I (a) don't think TCP DNS routinely stuffs two requests in a single TCP segment and (b) don't believe TCP DNS is ever more performant than UDP DNS, including on mobile --- UDP gets a head start from not having a 3WH, and doesn't have rate limiting, which TCP does. TCP headers are also much larger than UDP headers.
Even the reliability argument doesn't make sense. Yes, TCP is "reliable". But so is UDP DNS, and in exactly the same way: if a request or response is dropped, it's retransmitted.
Nagle, for what it's worth, is an HN contributor. You could just ask him. :)
Agreed an all counts. Its a stretch to imagine TCP is better at performance.
{edit} though performance isn't really about wire time or packet size - its about cpu time on either end plus buffering. Including router time since that's a cpu in the path.
I wrote a basic TCP stack for a honeypot research project. It is hard and incredibly complex.
Yes and no. Most of the complication comes from extra functionality (segmentation offload, checksum offload, SACK) or from functionality which is required by the standard but not relevant for a DNS resolver (congestion control, window management, TCP timers).
If all you're doing is accepting a TCP connection, reading a small request, and writing a small response back, you can remove about 90% of the code from your TCP stack.
I don't write a TCP stack, but Juho Snellman writes a TCP stack for living, and I found the following anecdote on writing an interoperable TCP stack interesting.
I'm not the OP, but I think it's fair to call it complex, and I'd pick three requirements out in particular.
1. Path reachability, MTU discovery and MSS interaction
When sending outbound packets, you have to correlate incoming ICMP error messages in case they signal a problem. If the problem is that the packet is too big, you have to figure out what the MTU really is (which can take repeated attempts), so that you know what MSS to use (for TCP, or fragmentation boundary for UDP). If the path is unreachable, you have to remember that too. In both cases, you need some kind of global book-keeping so that you can do the right thing across connections. Some protocols (like active FTP) implicitly rely on MTU discovery on one connection signaling the MSS for another connection, so everything has to be path based, rather than connection based. Messy.
2. State management for error correlation
O.k., so you've figured out how to fragment an outgoing datagram and know what boundary to use, but how do you handle incoming error messages related to the fragments? Even for UDP, or other "stateless" protocols you actually do have to keep state so that you can correlate those error messages to the packets you sent. When the error message comes back, it will have the IP ID of the fragment, but nothing else is guaranteed.
This goes for (1.) too, but ICMP error messages can also be recursive and nested, and for a correct implementation you need to consider how to handle ICMP error messages that were themselves triggered by ICMP error messages. Several userspace stacks get this wrong, and can't correctly handle MTU discovery for UDP, or double-error correlation.
3. Heuristical and inconsistent caps on state
Many TCP implementations support selective acknowledgements and duplicate ack signalling, but what are their tolerances, just how much data can be retransmitted or handled out of order before you have the resend the whole window? there's no way to know, and if you get it wrong you can end up stalling a TCP connection for a significant delay. Unfortunately there are no simple limits, and in some cases the volumes are related to bandwidth delay products, necessitating some kind of integral control loop.
The problem with all of these is that they only show up "sometimes" and with particular networks or TCP stacks. I've limited these to interoperability issues - but there are other tricky complexities. For example, when building a TCP stack, do you optimise for throughput and so batch reads/writes of many packets - or do you optimize for a correct RTT estimate, and do things more synchronously. It's not possible to have both (at least with today's NIC interfaces); sometimes RTT is critical (e.g. an NTP implementation, a real-time control system or just any system that needs to rapidly recover from packet loss) , sometimes throughput is more important. Definitely complex.
Getting a performant TCP is certainly hard. So, for that matter, is getting congestion control right --- TCP congestion control is devilishly hard. But you don't have to do either of those things to get an interoperable TCP!
Seems like this user land tcp stack could be the basis of types of c10m servers. Are tbere existenting userland tcp stacks available? I'm not familiar with any.
Absolutely, but I would say that this has the potential to be a lot more secure than a traditional (full) TCP/IP stack. Most queries are UDP (one packet), and we would expect that TCP connections should only last a few packets. TCP connections that don't match a handful of patterns would be suspicious (IMHO) and should probably just be dropped.
Of course, then you find some weird version of Windows XP that this breaks :-(
> TCP connections that don't match a handful of patterns would be suspicious (IMHO) and should probably just be dropped.
I used to believe that. It is unfortunately not true. TCP packets will come in all shapes and forms, and all must be treated equally, which is what makes TCP stacks so incredibly complex to implement.
The Linux TCP stack is quite safe and fast. Especially with tight integration with NICs hardware (checksum offloading and the like). I'm a bit unsure what a custom stack in userland can provide that the standard kernel stacks don't have.
The Linux TCP stack is NOT fast. My DNS server can respond to DNS queries faster than the simplest of in-kernel echo servers (like ICMP ping or UDP port 7 echo). That's with the entire DNS overhead of parsing the DNS protocol, looking up the name in a very large database (like the .com zone), doing compression, and generating a response.
The upshot is this: going through the Linux stack, a DNS server is limited to around 2 million queries/second. Using a custom user-mode stack, it can achieve 10 million queries per second.
With the advent of DNSSEC, IPv6 and EDNS0 you're more likely to see DNS responses >512 bytes, therefore falling back to TCP (with truncate bit set). Therefore it is strongly recommended you do not drop/block tcp 53 on your middleboxes , firewalls etc.
Another interesting use-case for TCP in DNS is for anti-DDoS. If a botnet is abusing your DNS server to flood traffic to their target, flipping the 'TC' bit which will force the request to come back over TCP, exposing the spoof.
EDNS0 includes a mechanism for clients/resolvers to signal that they can handle a large/fragmented UDP response. At this point about 85% of requesters can handle UDP responses of at least 4K. For the moment, DNSSEC and EDNS0 are making falling back to TCP far less common than it used to be.
That may change, as some providers are starting to put smaller limits on their response sizes (to mitigate certain kinds of DDOS and response spoofing attacks). Of course permitting TCP 53 is required for DNS to work; as is permitting UDP fragments (which poorly configured firewalls often block too).
Sure; I'm not suggesting dropping TCP entirely, just that e.g. a 1MB request / response is not going to be legitimate, and so you can simply not implement a lot of TCP's complexity (e.g. window scaling)
not related to this project, but rob is the instigator for massscan (https://github.com/robertdavidgraham/masscan) as well, whic claims to scan the entire ipv4 space in approx 6 minutes!!!
fd = socket(AF_INET6, SOCK_DGRAM, 0); if (fd <= 0) {
This should be fd == -1. fd == 0 is valid.
https://github.com/robertdavidgraham/robdns/blob/master/src/...
https://github.com/robertdavidgraham/robdns/blob/master/src/...
check errors from sendto() & recvfrom(). you don't want to loop infinitely when fd is hosed.
https://github.com/robertdavidgraham/robdns/blob/master/src/...
Check return value of your mallocs.
https://github.com/robertdavidgraham/robdns/blob/master/src/...
Check return value of fd.