Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Revoking certain certificates on March 4 (letsencrypt.org)
364 points by teddyh on March 3, 2020 | hide | past | favorite | 155 comments


For context in terms of what caused this, here's the PR which I assume fixed the bug in question: https://github.com/letsencrypt/boulder/pull/4690

It looks like a nasty and subtle pass-by-reference of a for-range local variable, although I'm having trouble figuring out where the reference is stored: https://github.com/letsencrypt/boulder/blob/542cb6d2e06e756a...

I've spent plenty of time hunting down similar bizarre bugs in Go code as well, where the called function ~implicitly~ takes a pointer to the iteration variable and stores it somewhere. Each iteration of the for loop updates the stack-local in-place, and later reads of the stored reference will not read the original value. It's hard to spot from the actual call site :/

EDIT: This was an explicitly taken `&v` reference, but the same thing can also happen implicitly, if you call a `func (x *T) ...` method on the variable.


LE just posted their own (excellent!) incident report of this on the mozilla bugtracker, including the discovery timeline, analysis of the bug, and follow-up steps: https://bugzilla.mozilla.org/show_bug.cgi?id=1619047#c1

The original bug report, which was initially diagnosed as only affecting the error messages, not the actual CAA re-checking: https://community.letsencrypt.org/t/rechecking-caa-fails-wit...

Brief discussion on revocation exemption requests: https://bugzilla.mozilla.org/show_bug.cgi?id=1619179

Tomorrow will tell if granting a revocation exemption might have been a good idea in hindsight.


People say Rust’s borrowing rules are only useful for multi-threaded environments. This is one of those issues that they are supposed to solve.


I'm not sure I understand, it's a business logic bug, would have happen in any language.


How do you figure this is a business logic bug? It looks like a pretty clear-cut implementation bug to me. Rust would 100% have caught this bug, and in fact I'm pretty sure it would have caught the bug at least two different ways:

1. The reference outlives the original value.

2. You can't have multiple mutable references at the same time.


> 2. You can't have multiple mutable references at the same time.

If I understood the issue correctly, only one of the references would be a mutable one, so the way Rust could have caught the bug would instead be the related rule: "you can't have an immutable reference and a mutable reference at the same time".


Except that the reference in question would have caused a lifetime error in rust which would have required the developrs to explictly acknowledge the choice they were making, likely by changing a bunch of types.

Yes you could still do it in rust, but any reviwer of the code would say "why in the world are you doing it this way" because it would be forced into a complex cross call monstrosity.


[flagged]


An example from less than a day ago: https://news.ycombinator.com/item?id=22466354


I'm not super confident I understand the bug, but it looks like sequential access to the reference. If I'm not mistaken, a mutable borrow in rust would end up with the same bug.


I have not dug into the details enough to say if this is a bug Rust would prevent or not, I am only responding to the claim that people do not sometimes suggest that Rust's complexity only matters in the multi-threaded case.


Gotcha.


If I understand correctly, each `authzPB` collected in the iteration stores references to fields of an `authzModel`. Before the patch, these were identical, referring to the fields of the loop variable v. Each iteration of the loop, v is set, and all those stored references pointed to the new value.

Rust does give a compilation error for that.


That makes sense.


I think you're right, too.


Do you take issue with Rust?


What's particularly unfortunate about this is the comment just above the call:

    // Make a copy of k because it will be reassigned with each loop.
But v is reassigned with each loop too.

The real question is why there's so much pass-by-reference in the first place. K looks to be a domain name string -- it's almost certainly faster to copy it than to dereference it everywhere.


> The real question is why there's so much pass-by-reference in the first place. K looks to be a domain name string -- it's almost certainly faster to copy it than to dereference it everywhere.

I don't program in rust, so my knowledge here is limited to what these words mean in C/C++ - however shouldn't making a copy still require dereferencing the copy?


This is actually in Go, but the issue is the same.

Suppose you have a struct like this:

    struct foo {
        struct bar elem
    } s;
If you know the address of `s`, you just calculate the address of 'elem' from it and read the contents; a single memory read, all the data together cache-wise. Suppose on the other hand you have a struct like this:

    struct foo {
        struct bar *elemptr;
    } s;
If you know the address of `s`, you have to first read `elemptr`, and only then read the value of `elem`. That's an extra memory fetch, and probably from a different part of the memory than `elem` is from. Copying on modern processors is very fast, and the resulting copy will be "hot" in your cache. So conventional wisdom I've heard is that unless `struct bar` is quite large (I've heard people say hundreds of bytes), it's probably faster to just copy the whole structure around than to copy the pointer to it around and dereference it.

Caveat: I haven't run the numbers myself, but I've heard it from several independent sources; including, for instance, Apple's book on Swift.


Won't the pointer also be hot in cache in this case? I only ask because it seems to me like excessive data copying (and cache eviction) is a major source of slowness in modern programs. People are churning their cache to pieces by copying the world for every function call.

It's fine as long as your entire program fits neatly in cache, but once you exceed the cache size performance goes to hell because you force loads of misses of slightly-older data by constantly copying your working data.


> Won't the pointer also be hot in cache in this case?

Depends on the situation; the parent was somewhat vague.

As one example, suppose you've previously created a `struct foo`, but you haven't accessed it in a long time; then fetching any of its fields will probably be a cache miss. But if it contains a pointer, fetching data from that pointer will probably be a second cache miss – which can't even be started until the first one completes, because the CPU doesn't know what the pointer is. If `struct foo` contains the data directly, there's a good chance it all fits in a single cache line (typically 64 bytes). But even if it's split across, say, two cache lines, the CPU knows which cache lines need to be accessed, so it can send off the second request without waiting for the first one to complete.

That said:

- Strings are variable length, so they usually can't be stored in place anyway, although some languages/frameworks have "small string optimization", meaning that sufficiently short strings are stored in the space that would otherwise be used for the pointer and length.

- In general, Swift is hardly a good role model when it comes to performance. :p


Most importantly, programmers of the 2020's shouldn't be making these decisions.

Wether you choose to pass by reference or copy, that should just indicate your desired semantics. The underlying compiler can decide if it's really worth making a copy or not.

Sadly most compilers won't elide out most unnecessary copy operations, even when performance would be improved to do so.


Is it just me or is the PR and the associated linking really lacking? The PR doesn't have a description and neither it or the commits link back to the original communication (or vise versa).


For even more context, this seems to have been on a Friday night (assuming US West coast) with production down: https://letsencrypt.status.io/pages/incident/55957a99e800baa...

I'll cut the LE team some slack on this one :) the PR does have tests


Sure. But it's now Tuesday. They should have gone back and edited in links to all the relevant documentation.


Emailing users and giving them only 24 hours before their certs are revoked seems very unreasonable. Say you are down and out with a stomach bug or on holiday for a day or two.

My understanding is that the 90 day lifetime is largely because revocation can be thwarted. Thus the practical difference between 24 hours and one week is meaningful for server admins, but inconsequential if someone is staging an attack.


The Baseline Requirements for publicly-trusted CAs (section 4.9.1.1) require timely revocation of mis-issued certificates - either 24 hours or 5 days depending on the reason. I'm not entirely certain which is applicable here, but I'd assume Let's Encrypt's hands are tied in this case.


That is a very useful bit of info. I guess if the mis-issuance happened on Friday evening PT, then fives days is March 4th.


The misissuences have happened over the last several months (since at least December 2019), but it does seem that it was _discovered_ on Friday.


I'm glad they decided on the 24 hours, unlike CAs like Comodo which really shouldn't still be a CA after all their fuckups.


Having too low of a reactionary period can be equally devastating; customers need to have ample time to react to the issue so they don't panic and deploy something buggy without testing beforehand.


Not thwarted exactly, but the problem is that the question "Is this certificate still good?" has three possible answers:

1. "Yes, it's still good"

2. "No, it's revoked"

3. "There was a network problem so I'm not sure"

Of course bad guys who know you'd get answer 2 can most likely ensure you have answer 3 instead. So the only safe thing to do is treat 2 and 3 the same. If we're not sure this certificate is fine then it's not fine. But in practice answer 3 is common anyway. For some users it may happen essentially all the time. So browser vendors don't like to treat 2 and 3 the same, even though that's the only safe option and that can thwart the effectiveness of revocation.

There's definitely further opportunity for improved tooling here. Perhaps this incident will drive it (Let's Encrypt's sheer volume can help in this way).


OCSP is a request/response protocol intended to answer certificate validity questions. It works as you describe, and failures cannot be treated as errors. An attacker who stole a certificate can use it even after revocation by blocking access the relevant OCSP responder.

https://tools.ietf.org/html/rfc2560

OCSP stapling is a mechanism by which a TLS server can make OCSP requests ahead of time and serve the response in-band. TLS clients get a certificate signed by the CA as usual, as well as a recent OCSP response signed by the CA attesting to its continued validity. OCSP stapling allows TLS clients like browsers to know a certificate's revocation status without having to make an extra request, but it changes nothing for an attacker who stole a certificate since they can simply not use it.

https://tools.ietf.org/html/rfc6066#section-8

OCSP Must Staple is an option that can be included on a certificate stating "I promise to use OCSP stapling". An attacker who stole a "must staple" certificate can either include an OCSP response indicating the certificate is revoked, or they can omit an OCSP response which the TLS client will treat as a hard error.

https://tools.ietf.org/html/rfc7633

In short, RFC 7633 makes certificate revocation work. Web browsers and web servers support this today. If you use Let's Encrypt's `certbot`, pass it `--must-staple`.


If you have Must Staple but don't have monitoring in place to detect that your OCSP responses are growing stale before they expire (or worse, you use Apache HTTPD which will happily replace a GOOD OCSP response with a newer BAD one) then you'd still be screwed here when Let's Encrypt revokes certificates.

You need at least effective monitoring and a good OCSP stapling implementation (IIS is supposedly pretty good at this) or else stapling is sadly going to make life worse for you not better.


So in the case that revocation now works, why is there a continued push to shorten certificate lifetimes?


Multiple reasons:

1. Firefox remains the only mainstream browser to support OCSP Must Staple.

2. OCSP Must Staple does not cover all threat models: if an attacker gains the ability to temporarily issue certificates for the victim's domain (rather than obtaining the private key of an existing certificate), they can request a certificate without the OCSP Must Staple extension. A more effective method would be something like the Expect-Staple header[1] (in enforce mode).

3. It allows the ecosystem to move significantly faster. In a world where all certificates expire after 3 months, phasing out insecure hash algorithms (in certificates) would no longer take many years.

4. It encourages regular key rotation (even if it's not enforced)

[1]: https://scotthelme.co.uk/designing-a-new-security-header-exp...


Items 3 and 4 seem like weak arguments. We are still dealing with operating systems from 3+ years ago, so moving below a 1 year certificate length wouldn't buy much agility in terms of new algorithms.


Hash algorithms may not have been the best examples as they require client support.

A better example would be something like Certificate Transparency. Currently, browsers may require Certificate Transparency for certificates issued after a certain date. A malicious or compromised CA may work around this by backdating certificates. This would be less of an issue with shorter certificate lifetimes.


Well it is very inconvenient, but isn't this the strength of certificates?

When something is wrong you can revoke them immediately.

Why leave a potential vulnerability open for more than 24 hours?


Because revoking them will cause interruption to legitimate users, but doesn’t stop an attack.

I’m just starting to think LE is more aimed at large organizations than people running smaller configurations. Which is fine, thankfully we still have traditional CAs. I just hope we don’t devolve into a monoculture of ACME-only SSL.


LE is actually following the rules outlined by the Baseline Requirements. "Traditional CAs" have a tendency to just ignore them when convenient.

For example, Sectigo has misissued nearly every certificate since 2002, including ~11 million unexpired ones (as December) and decided to just ignore their duty to revoke misissued certifcates [1].

Should the rules be changed? Maybe. However, when you're giving an immense responsibility to CAs then public trust is paramount. Ignoring agreed upon rules whenever you find it convenient does not inspire much confidence.

[1]: https://bugzilla.mozilla.org/show_bug.cgi?id=1593776


ACME-only isn't the problem, it's a Let's Encrypt mono-culture I'm concerned about.

We could do with another LE-style service (or two) operated independently (both organisationally and geopolitically).


Yeah I'd love to see one or more additional free ACME issuers that are largely functionally-equivalent to LE, but in a different jurisdiction and under different management, with separate infrastructure, etc.

One of the less-obvious reasons: for "serious" usage where you're also stapling OCSP responses, there's a dependency on the cert vendor's OCSP service. You can cache the OCSP outputs to get through short windows of unavailability, but if the vendor's OCSP goes offline for days or suffers some serious incident, it pays to have multiple vendors on-hand. There was such an incident with GlobalSign back in October 2016 (who's otherwise a pretty decent vendor!), so it is a legitimate concern.

For "serious" use-cases, you basically need redundant live certs from redundant vendors, and not having a second LE-like option means one of those is still a legacy CA for now...


There's buypass.no / buypass.com ... they are a norwegian CA that also implements ACME. I have only used them for some testing certificates so far, that have not been deployed in the wild, but their server works, the certs are valid in all browsers, and they do upto 6 month valid certs iirc.

link: https://community.buypass.com/


I'm using it in production with no trouble at all - I have one place where it's only possible to add SSL certs through a GUI so longer validity is a dealbreaker.


Blame other CAs for resting on their laurels and allowing LE to steal their marketshare.


Oh I'm not shedding any tears for legacy-style CAs. But just because the previous situation was bad doesn't mean that a LE monoculture won't also be bad (for varying definitions of "bad").


My point there was more about automation in the cert space could lead to traditional CAs leaving the space, in which case small operators like myself (handful of minor servers) would be forced down the automation route, which isn't necessarily a net positive.


Absence of automation is why the CA death penalty is applied so late due to the consequent disruption.


LetsEncrypt has made the strongest headway in large organizations with thousands of domains like Shopify, Heroku, website builders, etc. as it hits a really sweet spot of usability (controlling the host lets them approve issuance), cost (free) and control (they can trigger mass refreshes).


I think it is more aimed at technically competent users, regardless of organisation size. It is not, as it stands, suitable for direct use by non-technical people who can nevertheless follow step-by-step instructions to purchase and install a certificate from the traditional CAs. Similar 'hold my hand' tooling isn't there yet for LE. Nothing about the protocol itself mandates such short validity periods though I presume?

Nevertheless technical people bemoan average users clustering towards centralised web-hosts but forget the reality that hosting a website from your own desktop or a VPS is far from trivial even in 2020!


>Nothing about the protocol itself mandates such short validity periods though I presume?

Actually, revocation is broken. Which is a large part of why LE uses 90 days.


The problem is LE had five days between reporting the bug and revoking as per Baseline Requirements, but they decided to wait until the last day to send out E-mails. This is not only inconsiderate to customers but also downright reckless to sensitive deployments involving LE.


So lets see, the deal is that I get free certificates on any and all of my domains. I get an easy way to install and update my certificates that works with my nginx services. I can move the service to a new address and instantly get a new certificate. The certificates are universally trusted. I get notified by email when there is a problem along with a way to detect and fix the problem available to me.

I'm speechless. I used to pay real money to get certs without half the service I get now for free.

Thanks letsencrypt.


>I get an easy way to install and update my certificates

You're lucky, because it took me forever to devise a renewal system that met my requirements of being able to renew certificates for domains that are used across multiple machines, both physical and virtual, many of the virtual machines which share IPv4 behind NAT thus cannot sanely use HTTP-01 renewal. Fine, use DNS-01, but: having to deal with my strict DNSSEC setup where I sign zones at home on a trusted device, keeping my keys off live servers, ensuring keys don't end up on untrusted hardware, and wanting my TLS private keys only on the servers they're being used on—all that suddenly makes my deployment more complex.

What I'd rather do is, since my nameserver runs a cronjob to pull signed zones from my home device (unauthenticated because the signed zonefiles are public information), I wanted to incorporate that into my DNS-01 ACME verification too. But all the existing tools (certbot, dehydrated, acme.sh, et al) seem not to support this style of setup well, where I can batch my DNS challenges, wait for my nameserver to pick them up, and then verify the challenges en-masse an hour or a day later (LE challenges last for seven days, so this is acceptable delay for me).

But I'm stuck with the CNAME/alternate NS approach where I have to run yet another network-facing service from home, for the duration of renewals, because I simply do not have time nor patience to sift through ACME's specification and implement a better tool suited for my needs.

>I get notified by email when there is a problem

LE has a mailing list for critical announcements in the interest of current customers and interested prospectors? Do tell, because all I know about is their blog and the Discourse forum, neither of which are solely for critical announcements.


> LE has a mailing list for critical announcements in the interest of current customers and interested prospectors? Do tell, because all I know about is their blog and the Discourse forum, neither of which are solely for critical announcements.

I think they plan to improve their communication after this mishap.

But you can use an rss reader to subscribe to the incidents category (search incidents.rss in this page), which is very low traffic. It's not “action required” level only, but with only two or three a year it may be suitable.


Yeah, I subscribed to that feed after seeing it posted here. It's good that they have something, but I'd much rather be opted in to "important updates regarding my account" like literally every other service does for me at this point.


> We confirmed the bug at 2020-02-29 03:08 UTC, and halted issuance at 03:10. We deployed a fix at 05:22 UTC and then re-enabled issuance.

Wow. That is a truly impressive way to handle a security bug and I know it’s not the first time Let’s Encrypt has responded extremely quickly.

I would love to hear how their engineering practices make this possible.


And I'm curious about the work culture park. They're a small organisation in terms of workforce, but somehow managed to respond within minutes on Saturday. How does that work? Do they have shifts or somehow the workers are so devoid of private life that they have to respond during early morning on weekend?


Speaking for myself here.

The workforce is spread across the states. When you're drawn to the mission like I am, late nights here and there don't matter at all. I communicated with my wife and I'm sure others informed their significant others what was going on etc and why Friday, Saturday, Sunday, Monday, and Tuesday would be thrown out of wack. Members of the team put in much more hours than I did and that is truly impressive. It takes all of us with our different specialties to make an accurate and effective response.

Some of the things we do are internal post mortems and find ways to prevent the issue from happening again by either improving alerting/monitoring, writing a runbook, fixing code, and fixing misconceptions about a part of the entire system. We do weekly readings of various RFCs, the Baseline Requirements, and other CAs CP and CPS documents to again better understand our system and Web PKI as a whole. This is an understatement, but we heavily rely on automation. From the moment the call was made to stop issuance, an SRE was ready to run the code that disables the issuance pipeline.

The biggest takeaway is that communication and leadership makes all the difference.

I have to go, there's work to be done.


For me at least 24/7 incident response is completely acceptable in a properly compensated role so long as it's accompanied by the culture that says preventing such incidents in the first place is Job #1

That is, I'm OK with being woken at 0200 to try to understand and if appropriate fix or recover from a disaster only so long as if I'd suspected this might happen the people expecting me to be awake at 0200 would have given me the resource (money, people, whatever) to fix it. If I feel like I don't have that support, I'll only start looking at your disaster during my working day.

My impression is that ISRG pays a lot of attention to preventing disasters, so if I worked for ISRG (not very practical since they're based on the US West Coast and I live in England) I'd be comfortable taking a call in the middle of the night to fix things.


operations based on US west coast could definitely benefit from a few people on the other side of the world to achieve a 24/7 coverage while keeping a good work-life balance.


Local nerds can be noctournal too. Letting people pick their preferred shift is just as important as accommodating other kinds of physiological diversity.


Can confirm.


You'd normally want the 24/7 people to be part of the day to day operations, otherwise they will quickly stop being up to date, so their don't go out of the knowledge loop, so selecting the reasonable timezone set is not trivial.


yeah, my assumption was that the team in another TZ would also work together on the same thing. Yes it has some challenges, but there are a few upsides beside the oncall coverage (e.g. increase in talent pool)


Yeah, as long as it doesn't happen often. I'm technically always on call but we haven't had an on call incident in close to a year.

Basically I keep a phone and laptop on me at all times.

This is in comparison to a friend that works somewhere that always has daily on call incidents that are not actually problems 95% of the time. That would piss me off even if I weren't always on call.


It's not my job to decide what counts as a "problem", but if the boss wants me to handle the CTO drunkenly leaving her phone in a taxi as an "emergency" to react to at 0200 then as described above I want the resources to handle that in advance when I predict it'll obviously keep happening. If my boss doesn't want to divert people or money on preventing such bullshit problems, then I don't want to get woken when they happen.

At huge scale there just will be incidents every day, if you're SRE for Google then every day is a bunch of problems - but at that scale those are routine incidents and you can afford the follow-the-sun team (actually I think Google SRE found it was better not to have true follow-the-sun but just two groups) to handle them, leaving only still rare "San Francisco just fell into the Bay" level incidents to actually wake people up.


Head of Let's Encrypt here.

We have an on-call rotation and a system for getting others notified and online quickly when necessary. We make sure not to bring too many people online so that some people are fresh and can rotate in later if the incident lasts longer.

It's not often that staff have to put in time at night or on weekends, and when it happens we work hard to make sure the problem doesn't happen again.


I assume they probably have 24h support, but also consider time zones: 3 AM Saturday in UTC is Friday evening in San Fransisco:

https://duckduckgo.com/?q=03%3A00+utc+to+pst&ia=answer


Out of hours response to a critical problem is standard. You can achieve it in various ways but they all boil down to people who know what they're doing having a professional ethic. Typically it isn't possible to have shifts of engineers with deep understanding of the code on call so ultimately you need to wake someone up. So remember to keep a note (up to date) of key staff home phone numbers, their home addresses.


The blog post only says they halted issuance within minutes of confirming the bug - not that they confirmed the bug within minutes of receiving a bug report.


It takes time to correctly determine if a bug report is real and to determine the possible scope of the bug.


In California, it was Friday evening still.


They don't have many incidents or outages. As a result, it's much easier to respond to the incidents that do occur.


It also only affects you if you are issuing the certificate for more than one domain name if I'm reading it right.

What's supposed to happen:

  For each fqdn in the request
    if challenge succeeds (eg dns-01)
      check whether caa record exists
      if (it doesn't) or (it does and allows issue)
        issue certificate
In the step on "check whether caa record exists", instead of using the domain name that is being issued in this loop, it uses the first one it found (or one of them, it's unclear which one). So theoretically, if you wanted a cert for:

  domain1.example.com, domain2.example.com
and you had a CAA record for domain1 that allowed letsencrypt but then a different CAA record was added between the CAA check on domain1 and the CAA check on domain2 (which wouldn't happen because of the bug) you could get a cert for domain2 that the CAA record said not to issue.


Kinda frustrating having certs revoked when there have never been CAA records for any names involved. I know they can't know that to be true historically, but I wish they could do some additional filtering.


Here's some quick&dirty stats from the list of revoked certificates: https://gist.github.com/SpComb/6338facd12e020ec4fe561ca91f32...

There's 3M "missing CAA checking results" in total, of which 2M are dated from 2020 and 1M from last month. FWIW the only certs of mine affected were old certs from 2019-12 which had since already been renewed in Feb, and the renewed certs are not affected?

The largest account has 445k certs revoked, and the most revoked certs from last month (most likely to still be in active use?) is 43k for a single account. I hope your rate-limits are in order if you're going to start reissuing all of those before midnight :/

BTW account number 131 at the top of the file seems to mostly be akamaiedge.net sites :)


There is a list of all affected certificates posted under https://letsencrypt.org/caaproblem/ - and it looks like they are also leaking the account IDs from the list, so now you can map different domains/certificates to the account that got them issued.


Yeah, it does seem like it'd have been sensible not to list the account ID in this file. It's convenient if you know your account ID and want to pull out just your certs, but for most people this associates all their certificates together.

If you own both https://www.happy-rainbow-nursery.example/ and https://hardcore.bdsm-videos.example/ you probably go to some lengths to avoid visitors realising the connection. Nothing you're doing is illegal or even unethical - but it's obviously going to cause uncomfortable conversations so why not avoid that altogether. Let's Encrypt aren't doing you a favour if tomorrow a mom at nursery says now she knows why you sound so much like Masked Mistress Martha...


Yea, this is really bad. I've done some searching of the data. Sometimes it doesn't matter. It looks like whoever is currently running gab.com is probably a big consulting company with like 100 other clients, so there's no big relation there. But if you run a small personal blog and use the same e-mail address for maybe more controversial sites that are hosted on different IPs, now you could get doxed.

I'm guessing customer IDs are associated with e-mail addresses? This seems like a good case of using different e-mails for ever cert. There are open source tools like anonaddy.com you can host yourself or buy from them (they have a decent free tier).

I feel like this list seriously needs to be pulled. There is some serious lack of oversight here.


> I'm guessing customer IDs are associated with e-mail addresses?

They are (on Let's Encrypt's end), if an email address was provided.

It's a 1:n relation, the same email may be used for any number of ACME accounts. Roughly speaking, for most clients, the ACME account maps to a specific ACME client on a specific host. If you run three servers with separate ACME clients, you're probably using three ACME accounts (even if you're using the same email and issuing certificates for the same domain).

Large or custom implementations may reuse the same ACME account across many servers and domains. (Issuance would typically be centralized and operated as a separate system in these scenarios.)


There are several links in this thread, but the following page allows you to enter your hostname and check online.

(no SSH, terminal access, etc) and it's from the letsencrypt team (linked in blog post).

https://unboundtest.com/caaproblem.html


Alas, only works for things on port 443 which is a bit of a problem for most of my certificates...


If anybody needs a bulk check:

for domain in $(cat domains.txt); do printf "$domain :" && curl -XPOST -d "fqdn=$domain" https://unboundtest.com/caaproblem/checkhost; done


>CAA

What is a CAA? Letsencrypt: please dont use initialisms in customer facing blog posts without using the FULL name at the first use. Makes things more learnable and googleable.



CAA is a term of art for a standard encoded in DNS. It stands for "Certification Authority Authorization", but most people who know what a CAA record is probably do not recognize it written as words. (I know I would need to read it several times to know that's what they meant, and I do this for a living.)


Expanding other DNS record names isn't very helpful either. I know what a Canonical Name for something is, but CNAME seems clearer, Mail Exchanger definitely isn't a helpful way to think about what MX records are for...

PTR and TXT aren't initials they're just short for "pointer" and "text" neither of which is much help divining what they're actually used for, and presumably AAAA doesn't actually stand for anything at all (?) other than it's four times bigger than the A record.


4 A's for ipv6 of course, and 1 A for ipv4. How could that possibly be confusing?

Had I been given a vote we would be using AAAA and AAAAAA records.


It's a free service. I understand your frustration, but CAA is a well known DNS record type. If you read the description of the problem, it explains it.


I've been doing development and devops for years and have never heard of a CAA. Then again, I use http-01 and not dns validation, so that's probably why. Before LetsEncrypt I'd buy certs and it involved a lot of copying/pasting CSRs into web interfaces.


To be clear, CAA is relevant even if you're using http-01. CAs need to check whether the CAA records of a given domain allow/forbid issuance in addition to any of the methods used to demonstrate domain ownership to the CA.


Certificate Authority Authorization.

It's a fairly common initialism in CA/TLS world (heh). The DNS record is also named "CAA".


Confused me too. They even already have a page they could have linked to. https://letsencrypt.org/docs/caa/


This bug only affects you if you got domain validation (eg by dns-01) but didn't immediately issue a certificate.

Letsencrypt validates the domain ownership for 30 days, so the bug allows you to issue a certificate within that window, even if you added a CAA record after validation that says "don't allow issue by letsencrypt, or only allow issue by MyCA.example.com".

But if you have everything automated, you're checking for renewal and issuing every day and probably validating as part of that, so unlikely to encounter the bug, unless you validate in one step and then sometime 8h+ later, issue a certificate.


I don't think this is true - I don't recall using domain validation. I think it's more related to multi domain certs.


Thank you. I was just about to run tests on all my clients certs but as I’ve never done that it seems it’s not usefull.

(Will still have a look but less stressfully)


y, I use an automated tool to issue/renew them and all of mine that I checked were listed as OK.


Will a nightly certbot invocation will replace these revoked certificates without manual intervention?


Apparently you have to manually use --force-renewal for certbot to regenerate new certificates (I just did it just in case, even though I'm 99% sure that I'm not concerned).

I assume that by default certbot only checks the expiration date of local certificates against the system clock, it doesn't ping any external resources so it can't be aware that the certificate might have been revoked even though it hasn't expired.

I agree that it would be nice if there was such an option, although I assume that it would increase the server load significantly if certbot connected to letsencrypt's servers at every invocation so maybe that's why they didn't do it.


> I assume that by default certbot only checks the expiration date of local certificates against the system clock, it doesn't ping any external resources so it can't be aware that the certificate might have been revoked even though it hasn't expired.

I think the actual issue here is that the certificates have not been revoked yet. We know that they will be revoked, which is why we have to run with --force-renewal, but there is no process for certbot to know that a certificate, although not revoked, will soon become revoked. I would expect certbot to automatically renew the next time its ran post-revocation.


Actually i think it would not be a significant issue. At least for a range of situations. A static file with time stamp ranges in case of an issue could be resource-efficiently served to signal clients.


Run `certbot renew --force-renewal`. That's what it says in the email they sent me. But if you didn't get an email, then your domains should not be affected.


I didn't get the email, and my domains were affected. (Two of them are in the list, one of them current.)

I do get "you didn't renew your certificate" messages on a semi regular basis (domains that have passed out of my control) so I know they have my details.


My cert is affected, and I didn't get an email.

Check your domain using the linked online tool.


Prior to today's 1.3.0 release, no; in versions of Certbot ≥ 1.3.0, yes.


I don't believe so.

I hope they add support for that soon.


It was included in the Certbot 1.3.0 release today.

https://community.letsencrypt.org/t/certbot-1-3-0-release/11...

(only certbot-auto users are likely to get this release immediately)


Do you have analytics that give you some idea what fraction of subscribers use different versions of Certbot and other ACME clients ? The API is HTTP so presumably all clients could provide an HTTP user agent, although obviously some will show up as generic library defaults.

Was this feature something already planned (albeit presumably not for release today) or was it entirely inspired by this problem? I confess if you'd asked me "What extra features does Certbot need?" I would not have listed "Check OCSP to trigger renewal" though in hindsight it's a good idea.


> Do you have analytics that give you some idea what fraction of subscribers use different versions of Certbot and other ACME clients?

Those do exist, though I don't personally have access to them. When things are calmer, you might ask bmw or jsha on the Let's Encrypt forum for some more information.

> Was this feature something already planned (albeit presumably not for release today) or was it entirely inspired by this problem?

I wrote (incomplete) code for this feature several years ago -- specifically inspired by the idea that certificates might sometimes be revoked unexpectedly -- but I don't think anyone planned to continue working on it until this problem came around.


Thankfully it's fairly painless to see if you're affected:

    curl -XPOST -d 'fqdn=example.com' https://unboundtest.com/caaproblem/checkhost
Replace example.com with your Fully Qualified Domain.


That can only test port 443, and not say, sstp on port 8384, or soap/https on port 3443 or other rando ports for various internal https layered applications


It can only test hosts that it has access to. Otherwise you can download the file and check your serial number against the list.


It can't test hosts that aren't on 443 though, regardless of access to the relevant port, e.g. (obvs. elided mydomain here.)

    $ curl -XPOST -d 'fqdn=pop3.mydomain:995' https://unboundtest.com/caaproblem/checkhost
    invalid name pop3.mydomain:995
    404 page not found
    
    $ openssl s_client -connect pop3.mydomain:995 -showcerts </dev/null 2>/dev/null | openssl x509 -text -noout | grep -A 1 Serial\ Number | tr -d :
        Serial Number
            0325f31485b9c0f393e27b00e4678e881e3c


All my recent certificates are affected. I think it is the same case for the majority of us.


The most likely cause will be that you issued similar certificates a few days (more than eight hours) earlier with the same Let's Encrypt account.

Suppose on Wednesday you get a cert for example.com and www.example.com, and then on Thursday you realise you also need images.example.com - you use the same ACME account (if you run Certbot this will happen by default if you use the same machine and user account, it silently makes you a free account if you don't have one already) and so Let's Encrypt can see that this account showed proof-of-control for two of these names on Wednesday, so only fresh proof-of-control of images.example.com is needed. Unfortunately this bug means Let's Encrypt forgot to re-check CAA for the old names, and so there's a risk they technically were no longer authorised to issue for these names and shouldn't have given you the Thursday certificate.

Rather than try to argue about whether it's appropriate to disregard this check, Let's Encrypt decided to revoke all ~3 million affected certificates. That's maybe 2-3 days worth of normal issuance in the last 90 days, so lots but hardly "the majority".


I use DNS validation to obtain one certificate with both domain.tld and *.domain.tld wildcard. My certificates seem not to be affected. My certs last auto-renewed on 2020-02-20.


If like me you have several hundred certificates to check, please do something like this:

cd somewhere-nice

wget https://d4twhgtvn0ff5.cloudfront.net/caa-rechecking-incident... gunzip caa-rechecking-incident-affected-serials.txt.gz

for i in $(cat domains); do (openssl s_client -connect $i:443 -showcerts < /dev/null 2> /dev/null | openssl x509 -text -noout | grep -A 1 Serial\ Number | tr -d : | tail -n1) |tee serials/$i; done

cat serials/* | tr -d " " | sort | uniq > serials.collate

grep $( cat serials.collated | head -c-1 | tr "\n" "|" | sed -e 's/|/\\|/g' ) ../caa-rechecking-incident-affected-serials.txt

It will take a moment and then it may tell you that letsencrypt misspoke when they said they sent emails to everyone whose contact details they have.


I thought I was in the minority there! We have 45 certificates (of many more) that were affected, and our account id was listed, and it has an email contact associated. I got no email whatsoever, but I'm glad I had the foresight to check anyway.


I just noticed I got an email at 1949 UTC. I guess they're still sending them out. Presumably some people will receive their emails after the revocation.


I spoke to someone from the team, they’ve got another 10% to go (presumably much less now). I finally got mine as well, and they’re still coordinating to figure out the timeline to revoke. Presumably they’ll wait for the emails first.


I suppose this is one way to answer my question

I asked (on m.d.s.policy) on the 29th how many issuances were affected, Jacob replied saying they intended to spend that day figuring out the answer, but then there was nothing further from him. The incident doesn't seem drastic enough to prompt urgent answers so I intended to revisit later this week if I heard nothing further.

Now we have a complete list of affected certificates instead, (the answer to my original question is about 3 million)

I was sort of hoping the answer was going to be like five thousand or something manageable. Alas. In hindsight I guess this was to be expected.


Well, from my understanding, they have no idea of whether or not a given certificate was misissued as long as it meets the baseline criteria for triggering the bug (which seems to be just "issued after X date with more than 1 domain").

So while the number of certificates that should not have been issued due to a blocking CAA record is likely small (or possibly even 0), they have to revoke every cert that could have triggered the bug, as they have no way to travel back in time and find out what the CAA records they didn't check would have been.


The criteria also require that the challenge answers used to validate control were old (more than eight hours old).

If all proof-of-controls were fresh the CAA checks are also fresh for those proof-of-controls so there's no bug. That's why the big list is "only" about three million certificates.

Suppose you own example.com, example.org and example.net and all you do is every 60 days or so you spin up Certbot once to get a certificate for six names, the three domains and the associated www FQDNs - that won't trigger this bug because each time your old proof of control have expired and new fresh ones will be used, triggering fresh CAA checks.

You're right though that it's likely the number of truly mis-issued certificates may be zero because the most common way to have a CAA record deliberately changed to forbid Let's Encrypt after having successfully done a proof-of-control (the scenario that would trigger their bug) is researchers looking for bugs in CAA checking, and of course such researchers would have reported this to Let's Encrypt triggering exactly the same incident but probably at a more friendly time like a Monday morning.


Yeah my domain was affected. Renewed, that would have sucked if I hadn't seen this! Also, I didn't get an email and I'm pretty sure my certs were generated with my email!


FWIW, I believe any Caddy sites will not be affected by this since caddy does not manage multi SAN certificates. Even if it did, Caddy will immediately replace a certificate when it sees a Revoked OCSP response. So, if you're using Caddy, there's probably nothing you need to do. But if you are a caddy user and are impacted, let me know.


Does anyone know of a generic way to detect that letsencrypt will revoke a certificates soon?

The goal would be to have our automation automatically rotate the certificates when similar issues occur in the future.


To my knowledge, there's no such mechanism in any of the relevant protocols (i.e. ACME and OCSP).


That thread enlightened me to a great trick to force cert renewal even if it's been done too recently: add a second (sub)domain and make a new cert with both


Here's a bash one-liner to check all the domains you have:

for domain in $(cat list-of-domains.txt); do curl -s -X POST -F "fqdn=$x" https://unboundtest.com/caaproblem/checkhost ;done | sed '/is OK./d'


Oh Let's Encrypt - what a chaos tuesday this was for me - I was informed yesterday, 19:00 (UTC+7) and it was a fun evening - many befriended companies and their clients where not informed at all.

Funny to find out the otherwise awesome traefik has no force-renewal other than messing around with acme.json until it renews


PSA: The unbounded checker doesn't seem to work if you have certificates issued for both ECC and RSA keys. For some of mine, it passes the check with status "OK" and shows the serial number of the certificate for the ECC key. The certificate that is going to be revoked is not shown.


If you have more than one certificate in use, regardless of what flavour, they only see one and assess that. Maybe the checker should emphasise that. For small users they probably only have one certificate in use, so this avoids some problems.

The issue would probably also affect people who have geographically separate certificates e.g. if you have two servers in different regions and decided rather than make things more complicated for key distribution you'll just have them each get their own certificates for the same name - that's totally fine with Let's Encrypt (it doesn't scale, but if you had 500 servers not 2 you'd probably redesign everything) but obviously this test only sees one of those servers and won't check the other certificate.

There's no way to know, given that two (or more) valid certificates exist for a name, and seeing one of them, whether the others are still actively used anywhere.

It would obviously be pretty easy to build a web form where you can type in an FQDN and get told if any certificates matching that name will be revoked, but then you get false positives where it says yes, this certificate for some.name.example will be revoked, you rush to replace your certificate for some.name.example but maybe actually the one that will be revoked is from 20 December 2019, and you already got a newer one which was unaffected in February.


I wish they'd issue short-term scoped CAs under the same criteria as they currently use for wildcards.

No significant load on their infrastructure, and you'd not have to break the "private keys don't move over _any_ network" rule.


https://datatracker.ietf.org/doc/draft-ietf-tls-subcerts/

Delegated Credentials is the proposed mechanism to do what you actually want to achieve here. The certificate issuance is mostly the same except there's an OID 1.3.6.1.4.1.44363.44 and the digitalSignature KU is present meaning this certificate is intended to be used with Delegated Credentials and thus can sign things.

Although tightly constrained subCAs (which is roughly what you're describing) might be a way to achieve your goal, it's seen as disproportionately complicated and risky, so hence this proposed much more narrowly defined new feature for TLS.



Is there an RSS feed that you can subscribe to for alerts like this? I can't see one.


https://community.letsencrypt.org/c/incidents.rss perhaps?

(Pro tip: you can append .rss to many pages on discourse to get an RSS feed)


Is there anything to suggest if this relates to Certbot or API provisioning only? or both? We've checked a whole bunch of API v1 provisioned certs against their tool and nothing has been listed so far.


FWIW I think the reason we're unaffected (as far as we can tell so far) is because we're not re-issuing certs within a short time period. The bug their end was to do with checking CAA records, if you re-issued the cert for a multi-domain cert within a short period of time after the initial provision then it wouldn't re-check the CAA records. This meant that subsequent CAA changes wouldn't be checked and theoretically a cert could be re-issued despite a CAA record being added to prevent this. As i'm reading it, if you didn't re-issue within this timeframe then your cert can be assumed to be correct as the original CAA check wasn't a problem.


Only 13 out of ~3,500 of certificates I manage required renewal


3,500? wow. What are you managing if I may ask?


A CMS platform with customer provided domains.


I haven't got a mail (I think), and I don't see that on their web site or on their blog.

Is wading through Discourse threads now the new minimum requirement for using services?


No, you can get your information from Hacker News.

(Do check your domain even if you didn't get an email, since they have not delivered emails to everyone who is affected.)


They have known for over 30 days and they didn't notify me until about 10 hours ago that several of my domains were affected.

That's not okay. That's serious negligence.

EDIT: My bad. They have known about this for maybe a week, and they did a blog post 4 days ago. I was basing my thought that they had known for 30 days based on the title of the post and in the URL: https://community.letsencrypt.org/t/2020-02-29-caa-recheckin...


I'm confused about what you mean; 2020-02-29 was 4 days ago, not 30.

They didn't know the extent of the bug until the 29th of February: https://bugzilla.mozilla.org/show_bug.cgi?id=1619047#c1


Yes, I just got their mail. Four certs out of six, all issued at the same time were affected. The other two were not.


Looks like LE will be adding to the billion certs they've issued!

https://news.ycombinator.com/item?id=22434466


Can we get a “(some)” for the title?


Certificates are such huge a maintenance problem. So many mines waiting to blow up. We really need something better.


So it starts


So what starts?


random bugs


So, nothing new




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: