This is excellent. I never thought about it that way. Indeed, with syn cookies the receiving party has no idea what was the original sequence number, so it will treat _any_ delivered packet as first.
Couple of caveats:
- you can jam syn cookies enabled with tcp_syncookies=2 sysctl
- syn cookies are generally bad because they prevent to negotiate window scaling. Window scaling is important unless you are doing low bandwidth like telnet :)
- you can somewhat negotiate window scaling when tcp timestamps are enabled. But enabling tcp timestamps in general case brings little benefit and wastes 12 bytes of each packet for basically no gain.
- for a bonus point, consider what happens when both syn cookies and TCP_DEFER_ACCEPT are enabled.
> - you can somewhat negotiate window scaling when tcp timestamps are enabled.
Reading the RFC, couldn't the requested window scale be jammed into the sequence number? Section 2.3 limits the window scale to <= 14 bits. This article suggests that the MSS category fits into 2 bits, so we take up 6 of the 32 sequence number bits for mandatory additional data.
The syn cookie would then be (MSScat | WSCALE | HASH(key | saddr | daddr | sport | dport | sseq | MSScat | WSCALE)). The hash could then be 26 bits, leaving a very small probability of forgery of 2^-26.
To re-address the problem of replay attacks, the server could rotate keys rather than include a timestamp in the data component. The cookie would need to verify with one of the (small) K most recent keys. If K were 2, then any ACKs younger than COOKIE_AGE would verify, ACKs up to twice COOKIE_AGE might verify, and older ACKs would be rejected. The probability of forgery would marginally increase to 2^-25.
But, I'm no TCP expert. I'm sure there's something misguided or wrong in the above?
FreeBSD uses a pretty similar approach to what you've described:
MSS and Windows Scale are each turned into 3-bit table entries, SACK gets one bit, and one bit indicates which MAC secret was used; secrets are rotated every 15 seconds, two are kept, so client ACK needs to come in 15-30 seconds depending on where in the rotation cycle you are; 24-bits are used for the MAC. 15 second round trip covers the vast majority of internet connections.
Perfect, that's exactly the kind of thing I was looking for. Although I'm a little confused by the comment's description of the birthday attack -- it doesn't immediately seem useful to me for an attacker to find separate connections that generate the same cookie.
I'm not a cryptographer, but as I understand it, the birthday attack says if you have a valid hash for one thing, you can generate more things, and get a collision in a surpisingly small number of iterations.
In this context, it's easy to get a valid hash --- when the system is in syncookie mode, send a SYN from an address where you have visibility of the SYN+ACK responses.
Then, you could use that cookie (sequence number) to spoof ACK packets with other sources, and they've estimated the number of packets you need to spoof before you'll have probably generated a connection. That number of packets is significantly fewer than when the syncache has not overflowed recently, and you need to have sent a SYN, and have an exact match of the sequence number.
> But enabling tcp timestamps in general case brings little benefit and wastes 12 bytes of each packet for basically no gain.
I disagree; TCP timestamps are awesome. Linux enables these by defaults.
Quick search gives me some measurements from 2012 [1] that indicate that TCP timestamps are enabled on 83% of the top 100k web hosts.
You can afford to waste 12 bytes; the bottleneck isn't these 12 bytes but how well you get congestion control to work. And congestion control relies on getting an accurate estimate of the round-trip time
Edit: Also, just because 83% of web hosts having it enabled does not imply that it is a good idea to do so in general. They could just all be running the linux defaults and these could be just wrong
You can measure RTT continuously, accurately, and even in the presence of packet loss just with selective acks and a little bit of extra bookkeeping in the sender.
It's hard to overstate how expensive TCP timestamps are. The thing is that they bloat every single packet including control packets. 2% of the world's bandwidth is being wasted on this.
The only reason for anyone to implement TCP timestamps today is that iOS clients have horrible receive window scaling if timestamps are disabled. (Well, that was the only reason a few years ago when I was still in the game of keeping up with the quirks of different TCP stacks.)
I would guess that Apple could be convinced to fix this, if someone has the right contact. The iOS TCP stack shows a lot of care, generally: they do path MTU probing well, and they've deployed MP-TCP (requires apps to enable it though), among other things I can't remember. Fixing performance if timestamps are disabled seems like something they'd do.
Small correction: for good congestion control you don't need RTT, you really need the transmission time. RTT is easier to get, but very noisy, and late. By the time you get it, it's obsolete. The way to get good clean flow control is to measure forward transmission time, and have the receiver keep a predictor on the sender updated. Increments in the forward transmission time signal changes in aggregate queue length. Your measurement updates are always about the past, so you need to apply control theory to use them correctly.
> Also, just because 83% of web hosts having it enabled does not imply that it is a good idea to do so in general.
Kudos on having the integrity to point this out even when it superficially weakens your argument.
Also, a more relevant percentage would be: If you have 64KB packets (which you want, to get the best throughput) 12 bytes is a 0.018% overhead, less than one five-thousandth of your packet.
Edit: and apparently it's 0.8% with the default MSS of 1460B. Ugh.
0.8% is still an underestimate. The average size of a TCP packet in real networks is about 500 bytes, due to the ACKs. And timestamps need to be present in the ACKs too.
I love write-ups on these types of issues. It really is helpful to share the processes involved to understand and debug a problem by experts. Thank you, OP.
As suggested by majke, CloudFlare's blog is another great source of material regarding protocol wrangling at scale.
If anyone else has similar blogs, bookmarks, etc please do share :)
the word 'initial [sequence number]' is a term of art, not an adjective that can be substituted. very commonly called ISN. odd that the article never used the acronym.
> syn cookies are generally bad because they prevent to negotiate window scaling.
this is wrong thinking. they degrade performance under "attack", yes. the alternative is instant death. "attacks" need not be from the big bad internet either. my experience is that in the last 10 years most synflood activations are internal buggy or poorly throttled clients.
> enabling tcp timestamps in general case brings little benefit
> my experience is that in the last 10 years most synflood activations are internal buggy or poorly throttled clients.
From running a large messaging service I agree; most of the perceived attacks (synflood or otherwise) were actually coming from our own clients. But there was a periodic stream of strictly abusive traffic coming from who knows where; mostly UDP reflection, but a SYN flood every once in a while.
FreeBSD had a more fun SYN cookie bug in 2015 (fixed here [1], and I think this is the diff where it was introduced [2]; determining which releases it touched is an exercise for the reader); the initial sequence of the sender had been left out.
I guess you would have a similar lack of dogs, but also, if a connection was opened and closed quickly, a re-transmitted packet from the client would satisfy the SYN cookie calculation, and the server would re-open the connection, but at it's original sequence number.
The details are a bit hazy, but the client would get an ACK with SEQ behind where it had ACKed, and would send an ACK probe with it's latest values. The server would see an ACK ahead of where it had sent, and send an ACK probe. If the hosts had low enough round trip times, the number of ACK probes sent could be tremendous. For those unfamiliar with FreeBSD, the localhost interface on FreeBSD runs full TCP, and under high load can drop packets and retransmit. We ran into this on localhost first, but then later across the internet with external clients.
This write-up is from over two years ago (Feb 2018). I was curious if the issue has since been fixed. Turns out the functions for generating and checking the SYN cookie values have not changed since, so I guess it's safe to assume the bug has not been fixed.
This particular problem is solvable by having (client_sequence_number+1) to be part of the HMAC (thus checking it as well), and by storing the MSS in the MSB part of the generated server sequence number rather than in the LSB.
Then every packet other than the first data packet will be discarded as invalid, and eventually the client-side retransmits will take care that everything works properly.
The problem that is impossible to solve, however, is the lost third ACK (acknowledging SYN-ACK) from the client , if the client doesn’t send any data to server upon the connect. It’s sufficiently rare in today’s protocols, though.
Another problem that the above approach will create afresh is that it assumes that the retransmitted client SYNs will have the same ISN, which isn’t the same in practice with e.g. some load balancers (who also try hard not to keep the state). And that behavior is kinda a slightly gray zone in the TCP spec, IIRC...
Edit: (I wonder if the last paragraph above is the real reason or I missed something else)
Edit2: oh, thanks to Majromax’s mention of the DJB’s write-up, the above has the problem of not complying with “sequence numbers increasing slowly”, and indeed brings up a real-world scenario where that approach was an issue - using rcp/rlogin protocols, which reused a very narrow range of source ports, so the 5-tuple reuse was common.
> Another problem that the above approach will create afresh is that it assumes that the retransmitted client SYNs will have the same ISN, which isn’t the same in practice with e.g. some load balancers (who also try hard not to keep the state). And that behavior is kinda a slightly gray zone in the TCP spec, IIRC...
I don't think this is a big problem with SYN cookies. If you get a SYN with initial sequence X, you send an appropriate SYN+ACK, and if you get a retransmitted SYN (because the other end didn't get your SYN+ACK), you send a new SYN+ACK appropriate for that one. If you then get an ACK for either, you would form a full connection; which should work fine.
I would have to review the RFCS, they might say that if you had room in your syncache to hold the data, you should send a RST to the second SYN or the first SYN, because the states are conflicting; but since you don't have the information you don't have the information.
Anyway, unless the client end is really messed up, it shouldn't send both the ACK on the first SYN, because it received your SYN+ACK and a new SYN, because it didn't receive your SYN+ACK. I acknowledge that there are plenty of really messed up TCP stacks on the internet though :)
> Anyway, unless the client end is really messed up, it shouldn't send both the ACK on the first SYN
This is a race condition; hypothetic sequence of events:
send SYN-0, wait for reply or timeout
timer interrupt fires
timeout to resend SYN(-1) is ready, start running that
packet interrupt fires (interrupts resend)
got SYN+ACK-0, construct and send ACK-0
iret
finish constructing SYN-1 and send it
iret again
This is clearly a bug, but it could easily work >99.99% of the time (especially if the timeout is high enough that normal RTTs never hit it, which is probably how the person setting the timeout would try to set it).
I've noticed some issues a while ago with TCP syn cookies breaking DLM cluster setup of 64 hosts. (63 hosts all trying to join the same cluster generate enough TCP traffic that the kernel thinks it is a flood and starts sending syn cookies, but then the DLM join of some hosts doesn't actually complete).
This can be worked around by increasing the backlog (one of the mitigations listed in this article):
Network is fascinating. I initially introduced myself to the subject by reading Michal Zalewski's Silence on the wire [1]. I really recommend it to anyone who wants to do the same.
It's interesting to look at this from the perspective of HMAC codes: the host uses (several bits of) the sequence number as an authentication code over other aspects of the connection. The client IP and ports are elsewhere in the packet, so the sequence number also needs to carry data for the timestamp and maximum segment size.
From that security-based perspective, this bug seems to belong in a common category of data escaping the hash -- here where the (sequence number + MSS category) sum has hash collisions.
This bug really just falls in with "You didn't encrypt and authenticate your data, so anything could have happened".
Sure, this time it's a software design bug in the endpoints, but next time it might be a cosmic ray, or an evil middleman, or a buggy proxy. If data isn't encrypted and authenticated, then you shouldn't care what form it arrives in.
No, it's not that easy. It may also mean your transmission live-locks because your receiving state machine is waiting for more data that never arrives.
Yes, it needs a failure detection mechanism, because next time it might be a cosmic ray. But encryption and authentication alone doesn't help necessarily.
Nothing in this post says that the data wasn't authenticated; in fact, the symptom they saw (the client is kicked off because the server doesn't understand the message) is exactly what would happen if the data was authenticated.
The data's authenticated, but the authentication is accidentally broken. SYN(seq_num) and SYN(seq_num+3) can both generate the same cookie using a different (but still valid) maximum segment size.
To lump this into an existing category of bugs, syn cookies are a kind of HMAC, only the implementation is custom and nonstandard. It isn't a surprise that a bespoke HMAC leaks, but to the credit of kernel developers syn cookies the initial 1996 specification pre-dates the common understanding. (But to its demerit, it looks like the DJB spec (http://cr.yp.to/syncookies.html) would have not had this issue, since the MSS was encoded in the top bits of the cookie and not the bottom bits.)
A malicious client could make sure that you get valid but different data depending on whether noticed something was up. Maybe fool a logging/firewalling middlebox.
This bug falls in with "reasons connection attempts to your server may be mysteriously failing" - this is a bug in connection setup, when you'll be setting up your secure connection.
Couple of caveats:
- you can jam syn cookies enabled with tcp_syncookies=2 sysctl
- syn cookies are generally bad because they prevent to negotiate window scaling. Window scaling is important unless you are doing low bandwidth like telnet :)
- you can somewhat negotiate window scaling when tcp timestamps are enabled. But enabling tcp timestamps in general case brings little benefit and wastes 12 bytes of each packet for basically no gain.
- for a bonus point, consider what happens when both syn cookies and TCP_DEFER_ACCEPT are enabled.
More about syn packet handling in linux https://blog.cloudflare.com/syn-packet-handling-in-the-wild/