Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They are making it sound like they did everything right and it was a issue of third party library. If we list all the libraries our code depend on, it will be in 1000s. I can't comprehend how a CDN does not have any canary or staging setup and in a update everything could go haywire in seconds. I think it is standard practice in any decent size company to have staging/canary and rollbacks.


That's not the impression I got. Yeah, their takeaway was to stop using BinaryPack, which I disagree with. However, it sounded to me like they very much understood that they made the biggest error in putting all of their eggs in one basket.

Your system WILL go down eventually. The question is how will you recover from it?


Right, this was our biggest failure (not the only one of course, but we are here to improve). Relying on our own systems to maintain our own systems.

We are dropping BinaryPack mainly because we're a small team, and it wasn't really a big benefit anyway, so spending more time than necessary to try and salvage that makes no sense. This was more of a hot-fix since we don't want the same thing repeating in a week.


That makes sense then with the additional context.

I don't know the details of your operation, but keeping your ability to update your systems separate from your systems is something I'd strongly encourage.


I came to post that, yeah. I work in a sensitive system on which people can lose millions for a few minutes downtime, and we are a bit anal about week long pilots where half the prod is in a permanent canary stage.

But also they used their own infra it feels to setup their stuff and if their infra was dead they couldnt rollback, which sounds like a case where people a bit too optimistic.

We had catastrophes too, notably on poison pills in a record stream we cant alter, but this update cascade crash sounds avoidable.

Always easy to judge anyway, always happens to you eventually :D


This. While failure, human or not, is unavoidable in the long term, from their writeup they do not seem to have procedures to avoid this particular mode of failure.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: