Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Bitsquatting: DNS Hijacking without exploitation (dinaburg.org)
107 points by jakub_g on Nov 18, 2012 | hide | past | favorite | 35 comments


A serious amount of Redis crash reports are due to memory errors (we ask to test memory after crashes, since real segfaults are very rare, not everybody tests, tests may lead to false negative, so the problem is bigger than the one we observe).

This experiment also seems to show that there is a lot of corrupted memory out there.

There is a simple fix, I wonder why it is not used:

1) Add a feature at operating system level that one time every second tests N memory pages, at random. No observable performance degradation.

2) Report the problem to the user when it is found.

3) Mark the page as faked up and don't use it, ever.

So you have memory tested basically for free, users aware of their hardware errors, and a lot less consequences for those errors.


Guild Wars did this in user mode in the background while playing - it did background testing of the user's CPU, RAM, and GPU. Machines that failed the tests were flagged so that their crash reports got bucketed separately (saving us the time of trying to understand impossible crashes), and it popped up a message telling the user their computer was broken.

So even if the OS should be doing this for you, for long-running processes you could do it yourself in user mode. (I don't know if it's worth the effort, though.)


That's exactly my plan with Redis, and it is awesome to discover that it was used with success in the past! But I've a problem given that I can't access memory at a lower level, that is, when to test and what?

I've the following possibilities basically:

1) Test on malloc(), with a given probability, and perhaps only up to N bytes of the allocation, for latency concerns.

2) Do the same also at free() time.

3) From the time to time test just allocating a new piece of memory with malloc of a fixed size, test it, and free it again.

"3" is the one with the minimum overhead, but probably 1+2 have a bigger percentage of hitting all the pages, eventually...

I don't have a broken memory module to test different strategies, I wonder if there is a kenrel module that simulates broken mem.

Note that Redis can already test the computer memory with 'redis-server --test-memory' but of course this requires user intervention.


That's really smart.


Even better, albeit more expensive, would be to start using ECC memory more frequently. It seems crazy to me that we're storing billions of bits of often important data in relatively dumb storage that can't even detect if it gets corrupted.


TL;DR: every day, there are hundreds of wrong URL requests being done due to memory failures in the computers. Due to hardware problem, the computer can connect e.g. to microsmft.com instead of microsoft.com. The data gathered by the researcher suggests those kind of bugs happen also in web caches etc. thus increasing the number of affected users.

In practice, the privacy problems coming from this are rather limited (unless you send private stuff in URL), since in majority of cases, you'll not be sending domain cookies for the original domain if it resolved to the bitsquat domain early. Anyway, it's still probably a thing not thoroughly thought of on a daily basis regarding security.


There are follow-up blog articles on the author's page. This one is also interesting: http://blog.dinaburg.org/2012/10/a-preview-of-bitsquatting-p...

The original link was found at h-online.


I did this experiment by bitsquatting all the domains around cloudfront.net after hearing about it from defcon. It works. You basically have the opportunity to replace the javascript of tons of sites. I simply served 404s. What was really interesting to me was the varied places where the corruption occurs. Some of the requests even have the correct Host header. Now you know why the old PC was so flaky!


I've started thinking about all those banks [1] and other pages serving like/tweet buttons on the login page.

Or pages including Google Analytics. If the described behaviors really take place, given the massive scale of deployment of Google Analytics, Statcounter, FB buttons, jQuery includes from CDNs, you should be able to do arbitrary JS injections to a non-trivial number of users (though very random).

[1] http://my.opera.com/hallvors/blog/2012/05/11/social-media-ba...


It's pretty amazing to see so many bit errors make it through DNS resolution without the client machines crashing instantly; imagine if the bit errors were introduced in RAM containing code or data structures with memory pointers, instead of a domain name!

Thinking about it, I guess such crashes are actually a much larger number.


Everyone who's worked with computers for any length of time has seen inexplicable, unreproducible crashes, lockups or reboots.

We're just conditioned to ignore them if they're neither frequent nor reproducible.



Hi,

Author of the article here. Was happily surprised to find it linked on the front page of HN.

I can try to answer any questions that you may have have.

I'm also in the midst of writing another blog post, this time talking about the bit-error distribution in the DNS query type field. Spoiler: its not uniform.



The part about the flip being mostly on a single bit reminded me of [1], I wonder if we're seeing the same cause in two different ways?

[1] http://mina.naguib.ca/blog/2012/10/22/the-little-ssh-that-so...


HTTP 1.1 includes a header field called Host.

Technically, this statement is correct, but don't overlook the fact that while the Host header is optional in HTTP 1.0, most HTTP 1.0 clients will include it out of necessity. It's nearly impossible to guarantee you'll get the correct resource without a Host header, these days.


Are there HTTP 1.0 clients still in use? I can't imagine why a maintained program wouldn't be using 1.1.


Very fascinating... if a very narrow scenario with a very low probability resulted in this, how is it that these errors are not apparent during other computer activities ?


They probably are, but people are used to bluescreens and unexpected app crashes and think nothing of it.


I dunno. Something about this analysis bothers me. The basic premise is that a string the length of a domain name routinely, albeit rarely, gets corrupted by one bit, causing errant DNS lookups. Then it's reasonable to assume that a longer string is even more likely to contain corruption. But if that's so, why do I almost never see any evidence of bit corruption in my web server logs? Surely the same corruption would affect other parts of the URL, and the probability should be greater due to the length. But I can't find a single example in my logs that can't be explained by human error (typos by users or developers). If bit corruption is so overwhelmingly prevalent in hostnames, but not URLs or other identifiers, I suspect it's due to a software bug somewhere.


Presumably your web server doesn't serve quite as much traffic as fbcdn.net. The odds of such a bitflip happening are vanishingly low, so you need an incredibly large amount of traffic before you'll see such errors occurring.


In my understanding of network communication and data transmission this should be impossible. All payload data and encapsulated header data etc is subject to checksums, hash's, variable encoding schemes on the wire, parity balancing, redundant bit insertion (hamming) etc. The result of which will always signal an errors presence. Even if the bit flip occurs in Primary memory surely the OS's memory management subsystem's would detect the corruption.

So for a bit flip not to be detected and remedied before the execution of an errant DNS lookup seems odd. Although I could be wrong (just a final year CS student).

EDIT: Just watched the video, originally classified it as TLDW, seems plausible.


Note that no error detection code is able to detect all errors; it just lowers the probability of an error passing undetected even further. (CRCs are pretty robust in that they always detect sequences of errors with length <= N, with N depending on the particular algorithm.) With a large enough sample size, you will hit errors.

In this case it is probably memory corruption. The OS won't be able to detect such a thing unless the memory has ECC (relatively uncommon these days). It could theoretically detect it if the memory pages were checksummed and periodically verified against the checksum, but afaik no OS does so.


On consumer hardware, there is typically no mechanism enabled to detect errors in RAM.


I wonder if the distribution of hits to these domains follows the popularity of the underlying "correct" domain, which is what you'd expect if random bit errors were causing these hits and would help corroborate his claim.


I'm wondering if the string length or alignment could affect it, too. For example, if memory is allocated in 16byte chunks, longer domain names might have a larger chance for having a bit flipped in the active part of the string (instead of the padding). Just speculating wildly here.


There's no padding in DNS requests (having written my own DNS decoding routines). There's also very little that can change in a DNS packet that won't cause an error---basically, the only thing that can change without causing a DNS decoding error are text-related fields (say, the payload of a TXT or SPF record type) and even then, given the restrictions on character sets in DNS host names (and the crazy compression scheme used for domain names), it's actually surprising to see bit-6 errors, as that bit should cause more invalid domain names than not.

Edited because I thought bit-6 errors would flip letter case (upper to lower, lower to upper) when it's bit-5 that will do that.


I wasn't thinking about padding in the DNS packets as much as I was thinking about padding in string routines, for example if you malloc a block to hold "www.example.com", and then pass that string onto whatever resolver library you use.


I'm a little skeptical of this whole thing. Obviously, one other thing about bit errors in DNS packets is that they need to not break label compression.


Do you mean the whole article, or just the parent post? Because we're probably talking about bit errors while the domain name is still just a string in application memory, not when it's being assembled to a DNS packet or in transit. So I don't see how label compression comes into play.


To test for this probably the simplest thing to do is to also register typos that result from more complex bit flipping patterns (unlikely to result from memory errors) and see if the number of requests is comparable or not.


This reminds me of this (slightly old) article, wherein the authors observe that about 1 in every 30,000 TCP packets fails the TCP checksum, even though the actual errors should have been caught by the link-level CRC. They go on to speculate about possible causes, which include memory corruption at the hosts.

A good read, and it seems to be an instance of the same problem.

"When The CRC and TCP Checksum Disagree" - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.27.7...


Are those really memory corruptions and not, say, damaged cables or bugs in network infrastructure equipment? Or, just typos in scripts (both server and client side)?


Other bit-flipping issues should be usually trapped by checksums. Typos in scripts is possible but then you should get a massive amount of requests from the same IP and should be easy to notice.


Yet, ECC on consumer hardware ranges from exotic to non-existent.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: