Robdns – A fast DNS server based on C10M principles

halayli · on Dec 28, 2014

Few nitpicks found while skimming through the source.

fd = socket(AF_INET6, SOCK_DGRAM, 0); if (fd <= 0) {

This should be fd == -1. fd == 0 is valid.

https://github.com/robertdavidgraham/robdns/blob/master/src/...

check errors from sendto() & recvfrom(). you don't want to loop infinitely when fd is hosed.

https://github.com/robertdavidgraham/robdns/blob/master/src/...

Check return value of your mallocs.

https://github.com/robertdavidgraham/robdns/blob/master/src/...

Check return value of fd.

tptacek · on Dec 28, 2014

The zero fd is valid but will never happen in this code.

Checking the return value of malloc is a bad idea (just rig the program to explode when any malloc fails). But casting the return value of malloc is incorrect, and that code should be using calloc.

halayli · on Dec 28, 2014

How do you know? What if someone forked this program and closed fd 0 before execX()? You have to abide by the posix standard and what the man page tells you. Now it never happens, but later the code can change and the case can hit. Most probably you'd have forgotten about this check.

Malloc failures don't always have to result in aborting the program. Cases can vary.

My suggestion to you is abide by the man page and always check error conditions. Don't overthink the failure cases of stdlib & posix calls. You never know the OS/environment your program will run in, and the only common ground you have is the standard.

tptacek · on Dec 29, 2014

You don't know. It's true that the line of code he cited was incorrect. It's just not a very interesting example of incorrect code.

Malloc failures don't always result in aborting. The common alternative, especially in programs that have careful malloc return value checking regimes, is to occasionally cough up remote code execution.

Userland systems programmers should assume the conservative default of ending the program immediately when malloc fails.

A similar logic guided C++ into throwing bad_alloc instead of returning NULL on allocation failures. And a survey of modern C++ code will show you that most C++ programs simply allow themselves to terminate when bad_alloc happens.

halayli · on Dec 29, 2014

It doesn't have to be interesting. Incorrect code covers a spectrum and is not black or white. The term "edge-case" describes it pretty well.

Regarding mallocs, again you don't have to overthink error-handling. Just handle it and figure out how to deal with it if it happens.

C++ has nothrow and try/catch. If you want to catch a failed new or abort is something up to the user. One thing for sure is that C++ aborts on failed new instead of stumping on unallocated memory.

On the other hand, consistency is important. Some parts of the code handled fd correctly and malloc failures and other parts did not.

I recommend you applaud good programming practices and criticize bad ones. Defending/arguing bad programming practices(whether they are edge cases or not) doesn't help.

tptacek · on Dec 29, 2014

I responded to this downthread.

stonogo · on Dec 28, 2014

> Checking the return value of malloc is a bad idea

I am embarassed for you.

tptacek · on Dec 28, 2014

That's fascinating. Tell me more.

dkhenry · on Dec 28, 2014

He clearly has never heard of mallopt (http://man7.org/linux/man-pages/man3/mallopt.3.html) so he doesn't know that instead of wrapping every single malloc with an if statement you can just tell malloc to abrt if it ever fails to allocate memory.

In his defense most programs don't test the return of malloc and don't set the abort on failure so most programmers would just assume that if your not checking your return values that you made a mistake.

stonogo · on Dec 29, 2014

If a daemon is not capable of surviving memory exhaustion, it is not suitable for deployment. If a programmer is not capable of writing software that can survive memory exhaustion, they are not suitable for writing daemons.

I'll not argue the point further; the HN lynchmob is already beating at my door for daring to disagree with a site darling. Had I looked at your username before replying, I probably would have forborn from violating the bubble.

SamReidHughes · on Dec 29, 2014

Quoting and with total vacuity, replying, "I am embarassed for you," will get you downvoted no matter who the parent's author is.

tptacek · on Dec 29, 2014

What's a single example of a daemon with more than 10,000 lines of C code that will reliably "survive" memory exhaustion? By "survive" we obviously both mean that it retains its original PID. Of course, most Unix serverside code "survives" by simply aborting and allowing itself to be restarted. Since that's what I'm recommending, that strategy doesn't count.

To find one, you're going to have to catalog every malloc() in the entire program and record some kind of recovery regime --- maybe degraded performance, maybe dropped requests, maybe fallback to some kind of pool --- for every allocation.

halayli · on Dec 29, 2014

An in-memory message queue server or memcached variations don't need to crash if it can't serve a particular request that requires more memory than available.

I am not sure why error-handling is a difficult concept for you.

tptacek · on Dec 29, 2014

You just moved the goalposts.

Yes, it's suboptimal to have a server that receives a request with a 32 bit length field indicating 3 gigs of incoming data that the bombs out trying to malloc 32 gigs.

I'm saying: what's an example of a server that reliably, for every allocation of every piece of metadata, every strdup, every hash table entry, every connection object, &c &c, has a recovery regime so that it doesn't have to fail ever when malloc does?

One way you could find such a program would be to compile a candidate and preload a malloc that randomly (1 in every 100 calls per allocation size, for instance) returns NULL. See if the program (a) continues to run and (b) passes some simple unit test suite.

My contention is:

(a) Those programs will be hard to find, because most "production ready" Unix code does not have that property, and

(b) The criteria I'm talking about matters a lot, because (1) memory exhaustion strikes at totally arbitrary points in a program's execution, not just at the points where you're prepared to handle it, and (2) attackers can pinpoint exactly the allocation they want to have fail.

As an example of an approach that doesn't seem to work well: in "memqueue", the function that creates HTTP headers in responses allocates an array of iovecs. If that malloc fails, the header creation function returns -1. The function that calls the header-creation function, http_respond, catches that error and itself returns -1. Nothing ever checks the error return from http_respond.

(Also, the loop in which memqueue reads entire requests into memory by continuously realloc()'ing a receive buffer has an integer wrap bug in it, though it's probably not triggerable. But incorrect is incorrect, right?)

ahh · on Dec 29, 2014

Also worth noting: under normal conditions on Linux, malloc will never fail--unless you set something like a rlimit on address space, mmap will happily hand you as much address space as you want, then OOM-kill you (or worse, someone else...) when you try to make it resident. So even if we could write malloc-failure-safe software, which we can't, it'd be almost impossible to end up in that condition.

halayli · on Dec 29, 2014

you don't have to receive a request asking for 3 gigs of data. You could be close to the edge and there's not enough room to receive this particular request.

There are many daemons that require you to keep what you have in memory and just fail the particular action you are doing instead of throwing everything away. DB servers for example.

memory exhaustion is an error that should be handled like other errors.

in memqueue, http_respond logs the failed memory allocation(in http_cli_resp_hdr_create) and returns -1. there's nothing else I need to do in this case. The connection will get dropped without a response.

when reallocing, I don't see the integer wrap bug. Can you point me to the line?

What I am suggesting is to not overthink system error handling. Just handle it; aborting is one type of handling but not always what you want. Programs run in various environments and to guarantee a defined behavior we need to abide by the standard.

tptacek · on Dec 29, 2014

Then you've misunderstood me, or I've miscommunicated. My argument is that the default handling strategy should be to abort. I'm not saying that special case handling is evil. I'm saying that defaulting to manually checking malloc's return value is evil.

Also, your chunked encoding decoder seems to be using a signed strtol() routine to read an unsigned length variable. I could be misreading; I didn't look carefully.

halayli · on Dec 29, 2014

I see. The strategy should depend on the goal, of course.

memcached is a good example of a daemon that shouldn't abort imho, many other daemons could happily abort. https://github.com/memcached/memcached/blob/master/memcached... . In other parts of the code, aborting was a better decision. https://github.com/memcached/memcached/blob/master/memcached... . If you cannot preallocate conn structures why bother running?

It depends on the goal in the end. As long as it's an explicit decision and not relying on environment the behavior is expected to be defined.

For instance, I worked on an enterprise proxy where aborting on asserts wasn't acceptable. Why? Because the customer didn't want to interrupt his users even though in our opinion the proxy state was out of whack. This created a nightmare for us because it was hard to debug. We ended up fork()-ing and aborting on the side to debug the cores.

jvehent · on Dec 27, 2014

"The key feature is a built-in custom TCP/IP stack capable of handling millions of DNS queries-per-second per CPU core."

DNS uses UDP primarily. I suspect the author meant "UDP/TCP/IP" by "TCP/IP".

Years ago, I wrote a basic TCP stack for a honeypot research project. It is hard and incredibly complex. So this statement raises a number of concerns, and will need to be audited before being used in production.

dogma1138 · on Dec 27, 2014

TCP/IP is the common name for the Internet Protocol Suite which UDP is part of. Also since many mobile clients tend to issue requests over TCP (due to it being more reliable over mobile), and that many responses can be larger than 512 bytes now and EDNS is not really a standard DNS over TCP will probably over take UDP quite quickly.

The text below is from the RFC, and no it does not relates to zone transfers but to normal DNS queries :)

   All general-purpose DNS implementations MUST support both UDP and TCP
   transport.

   o  Authoritative server implementations MUST support TCP so that they
      do not limit the size of responses to what fits in a single UDP
      packet.

   o  Recursive server (or forwarder) implementations MUST support TCP
      so that they do not prevent large responses from a TCP-capable
      server from reaching its TCP-capable clients.

   o  Stub resolver implementations (e.g., an operating system's DNS
      resolution library) MUST support TCP since to do otherwise would
      limit their interoperability with their own clients and with
      upstream servers.

   Stub resolver implementations MAY omit support for TCP when
   specifically designed for deployment in restricted environments where
   truncation can never occur or where truncated DNS responses are
   acceptable.

And for the most important part :P

   Regarding the choice of when to use UDP or TCP, Section 6.1.3.2 of
   RFC 1123 also says:

      ... a DNS resolver or server that is sending a non-zone-transfer
      query MUST send a UDP query first.

   That requirement is hereby relaxed.  A resolver SHOULD send a UDP
   query first, but MAY elect to send a TCP query instead if it has good
   reason to expect the response would be truncated if it were sent over
   UDP (with or without EDNS0) or for other operational reasons, in
   particular, if it already has an open TCP connection to the server.

tptacek · on Dec 27, 2014

Why would TCP DNS overtake UDP "quite quickly" when it hasn't done so in the past decade? What's meaningfully changed about DNS recently?

_hyn3 · on Dec 28, 2014

Good call out. Maybe parent is just loosely referring to larger TXT records (or maybe DNSSEC or ipv6?) etc. My guess is that even these are probably a relatively small percentage of overall DNS traffic and that's not likely to change anytime soon.

So I can't imagine why records would all of a sudden exceed 512 bytes on avg either.

dogma1138 · on Dec 28, 2014

UDP isn't as reliable on mobile connections, many mobile clients issue TCP DNS requests with the UDP request at the same time and not waiting the "casual 5 seconds time out". DNS records also seem to grow and EDNS has not been adopted very well. Other things like Crhome's async DNS prefetch also seem to use TCP as much as UDP for some reason, especially to google DNS servers. The updated RFC mandates TCP support for regular DNS, and although i don't have a single reason (other than IPV6 records, TXT records use and DNSSEC) for why i have a strong feeling that mobile and browser optimizations are a good reason for that. When your browser does DNS pre-fetch from the DOM it becomes much more efficient to open a single TCP connection and issue all of the DNS requests (and with CDN's, adds, capthcas, social media and 3rd party content you can easily get to 20+ distinct DNS records per page) over it rather than issue individual async DNS queries over UDP. This will both be faster and more importantly more reliable for the next step which is the TCP-preconnect once it has resolved all the DNS records from the DOM even before loading it fully.

tptacek · on Dec 28, 2014

When your browser does DNS pre-fetch from the DOM it becomes much more efficient to open a single TCP connection and issue all of the DNS requests (and with CDN's, adds, capthcas, social media and 3rd party content you can easily get to 20+ distinct DNS records per page) over it rather than issue individual async DNS queries over UDP.

How are there any fewer RTTs with TCP DNS than there would be with UDP? I'm not seeing the efficiency here.

JoeAltmaier · on Dec 28, 2014

There is a per-packet cost to processing requests. TCP can bundle them; UDP cannot. Is it large enough to matter? I don't know. But the cost includes evaluating context of the requesting entity, which might not be meaningful for DNS queries. Maybe if they are related, some working-set caching would occur.

RTT might not improve at all. But lag might. Scripts often make the mistake of asking for information when they need it. Instead of before they need it, so it will be ready when needed. The suggested approach would pre-load the DNS info and might reduce lag.

tptacek · on Dec 28, 2014

You're saying that TCP DNS routinely sends multiple requests in a single TCP segment?

JoeAltmaier · on Dec 28, 2014

TCP sends successive data elements together, at least as part of the Nagle algorithm. I have no idea if the way DNS uses TCP can trigger Nagle.

In fact my own UDP protocol does something similar to Nagle as well. There's no good reason UDP protocols can't pick and choose what features they include. But most don't.

tptacek · on Dec 28, 2014

I (a) don't think TCP DNS routinely stuffs two requests in a single TCP segment and (b) don't believe TCP DNS is ever more performant than UDP DNS, including on mobile --- UDP gets a head start from not having a 3WH, and doesn't have rate limiting, which TCP does. TCP headers are also much larger than UDP headers.

Even the reliability argument doesn't make sense. Yes, TCP is "reliable". But so is UDP DNS, and in exactly the same way: if a request or response is dropped, it's retransmitted.

Nagle, for what it's worth, is an HN contributor. You could just ask him. :)

JoeAltmaier · on Dec 28, 2014

Agreed an all counts. Its a stretch to imagine TCP is better at performance.

{edit} though performance isn't really about wire time or packet size - its about cpu time on either end plus buffering. Including router time since that's a cpu in the path.

wging · on Dec 27, 2014

This is RFC 5966, for those who wondered: http://tools.ietf.org/html/rfc5966

cperciva · on Dec 27, 2014

I wrote a basic TCP stack for a honeypot research project. It is hard and incredibly complex.

Yes and no. Most of the complication comes from extra functionality (segmentation offload, checksum offload, SACK) or from functionality which is required by the standard but not relevant for a DNS resolver (congestion control, window management, TCP timers).

If all you're doing is accepting a TCP connection, reading a small request, and writing a small response back, you can remove about 90% of the code from your TCP stack.

tptacek · on Dec 27, 2014

Could you be a little more specific about what's incredibly complex about writing an interoperable TCP stack?

That aside: if I had to guess, this would be Robert Graham's 10th IP stack. He's been doing this (specifically) since the late 1990s.

sanxiyn · on Dec 28, 2014

I don't write a TCP stack, but Juho Snellman writes a TCP stack for living, and I found the following anecdote on writing an interoperable TCP stack interesting.

http://snellman.net/blog/archive/2014-11-11-tcp-is-harder-th...

TLDR: There are TCP implementations that can't handle SYN retransmission which you have to interoperate if your TCP stack is the product.

colmmacc · on Dec 27, 2014

I'm not the OP, but I think it's fair to call it complex, and I'd pick three requirements out in particular.

1. Path reachability, MTU discovery and MSS interaction

When sending outbound packets, you have to correlate incoming ICMP error messages in case they signal a problem. If the problem is that the packet is too big, you have to figure out what the MTU really is (which can take repeated attempts), so that you know what MSS to use (for TCP, or fragmentation boundary for UDP). If the path is unreachable, you have to remember that too. In both cases, you need some kind of global book-keeping so that you can do the right thing across connections. Some protocols (like active FTP) implicitly rely on MTU discovery on one connection signaling the MSS for another connection, so everything has to be path based, rather than connection based. Messy.

2. State management for error correlation

O.k., so you've figured out how to fragment an outgoing datagram and know what boundary to use, but how do you handle incoming error messages related to the fragments? Even for UDP, or other "stateless" protocols you actually do have to keep state so that you can correlate those error messages to the packets you sent. When the error message comes back, it will have the IP ID of the fragment, but nothing else is guaranteed.

This goes for (1.) too, but ICMP error messages can also be recursive and nested, and for a correct implementation you need to consider how to handle ICMP error messages that were themselves triggered by ICMP error messages. Several userspace stacks get this wrong, and can't correctly handle MTU discovery for UDP, or double-error correlation.

3. Heuristical and inconsistent caps on state

Many TCP implementations support selective acknowledgements and duplicate ack signalling, but what are their tolerances, just how much data can be retransmitted or handled out of order before you have the resend the whole window? there's no way to know, and if you get it wrong you can end up stalling a TCP connection for a significant delay. Unfortunately there are no simple limits, and in some cases the volumes are related to bandwidth delay products, necessitating some kind of integral control loop.

The problem with all of these is that they only show up "sometimes" and with particular networks or TCP stacks. I've limited these to interoperability issues - but there are other tricky complexities. For example, when building a TCP stack, do you optimise for throughput and so batch reads/writes of many packets - or do you optimize for a correct RTT estimate, and do things more synchronously. It's not possible to have both (at least with today's NIC interfaces); sometimes RTT is critical (e.g. an NTP implementation, a real-time control system or just any system that needs to rapidly recover from packet loss) , sometimes throughput is more important. Definitely complex.

tptacek · on Dec 28, 2014

Getting a performant TCP is certainly hard. So, for that matter, is getting congestion control right --- TCP congestion control is devilishly hard. But you don't have to do either of those things to get an interoperable TCP!

marktangotango · on Dec 27, 2014

Seems like this user land tcp stack could be the basis of types of c10m servers. Are tbere existenting userland tcp stacks available? I'm not familiar with any.

wmf · on Dec 27, 2014

http://shader.kaist.edu/mtcp/

http://www.openonload.org/

There's also some work showing that you can achieve very high performance with kernel TCP: https://www.usenix.org/conference/osdi14/technical-sessions/...

robertgraham · on Dec 28, 2014

I build stacks that are highly customized for the target solution and are impractical for general purpose use.

A good general purpose stack is the 6windgate stack. I know nothing about it personally, but I know that a lot of people do use it successfully.

zzzcpan · on Dec 27, 2014

Also take a look at NetBSD's rump kernels and PicoTCP. Nobody mentioned them.

daurnimator · on Dec 27, 2014

Check out snabb switch: http://www.snabb.co/ https://github.com/SnabbCo/snabbswitch/wiki

tptacek · on Dec 27, 2014

lwip would be a well-known example, right?

justinsb · on Dec 27, 2014

Absolutely, but I would say that this has the potential to be a lot more secure than a traditional (full) TCP/IP stack. Most queries are UDP (one packet), and we would expect that TCP connections should only last a few packets. TCP connections that don't match a handful of patterns would be suspicious (IMHO) and should probably just be dropped.

Of course, then you find some weird version of Windows XP that this breaks :-(

jvehent · on Dec 27, 2014

> TCP connections that don't match a handful of patterns would be suspicious (IMHO) and should probably just be dropped.

I used to believe that. It is unfortunately not true. TCP packets will come in all shapes and forms, and all must be treated equally, which is what makes TCP stacks so incredibly complex to implement.

The Linux TCP stack is quite safe and fast. Especially with tight integration with NICs hardware (checksum offloading and the like). I'm a bit unsure what a custom stack in userland can provide that the standard kernel stacks don't have.

robertgraham · on Dec 28, 2014

The Linux TCP stack is NOT fast. My DNS server can respond to DNS queries faster than the simplest of in-kernel echo servers (like ICMP ping or UDP port 7 echo). That's with the entire DNS overhead of parsing the DNS protocol, looking up the name in a very large database (like the .com zone), doing compression, and generating a response.

The upshot is this: going through the Linux stack, a DNS server is limited to around 2 million queries/second. Using a custom user-mode stack, it can achieve 10 million queries per second.

modusponens · on Dec 28, 2014

Do you have any insight about how other stacks fare on that same workload? (*bsd, qnx, ms-windows, minix,osx...)

tptacek · on Dec 27, 2014

Can you be more specific about the different-but-equal forms TCP packets take? You can dive into the details; I've implemented TCP stacks too.

thirsteh · on Dec 27, 2014

Performance. This isn't for security, or it wouldn't be yet another critical piece of software written in C.

nitinics · on Dec 27, 2014

With the advent of DNSSEC, IPv6 and EDNS0 you're more likely to see DNS responses >512 bytes, therefore falling back to TCP (with truncate bit set). Therefore it is strongly recommended you do not drop/block tcp 53 on your middleboxes , firewalls etc.

zaroth · on Dec 27, 2014

Another interesting use-case for TCP in DNS is for anti-DDoS. If a botnet is abusing your DNS server to flood traffic to their target, flipping the 'TC' bit which will force the request to come back over TCP, exposing the spoof.

My long-winded write-up here: http://opine.me/cert-advisory-on-dns-amplification-offers-li...

_hyn3 · on Dec 28, 2014

Very interesting, thx for the link!

colmmacc · on Dec 27, 2014

EDNS0 includes a mechanism for clients/resolvers to signal that they can handle a large/fragmented UDP response. At this point about 85% of requesters can handle UDP responses of at least 4K. For the moment, DNSSEC and EDNS0 are making falling back to TCP far less common than it used to be.

That may change, as some providers are starting to put smaller limits on their response sizes (to mitigate certain kinds of DDOS and response spoofing attacks). Of course permitting TCP 53 is required for DNS to work; as is permitting UDP fragments (which poorly configured firewalls often block too).

justinsb · on Dec 27, 2014

Sure; I'm not suggesting dropping TCP entirely, just that e.g. a 1MB request / response is not going to be legitimate, and so you can simply not implement a lot of TCP's complexity (e.g. window scaling)

signa11 · on Dec 27, 2014

not related to this project, but rob is the instigator for massscan (https://github.com/robertdavidgraham/masscan) as well, whic claims to scan the entire ipv4 space in approx 6 minutes!!!

ay1n · on Dec 27, 2014

He gave a talk about massscan at this year's DEF CON: https://www.youtube.com/watch?v=UOWexFaRylM which is quite fun to watch.