> The temperature spike should have been the first alert. The throttling should have been the second alert. The high error count should have been third.
This is a completely backwards. What you are describing is caused based alerting, which is strongly discouraged by SRE. SREs prefer symptom based alerting (e.i are users seeing errors) because you only get alerted when you know that there is a problem that effects your business. If you only have caused based alerts, you will get false alarms and un-actionable pages all of the time.
Imagine being an SRE who manages the GFEs in this situation. Now imagine getting a page telling you that the tempurature in a rack where your job is running is hot. What are you supposed to do? Is the issue effecting users? If so, how many? Is it anomalous? Now go poke around the system and try to figure out what is wrong. What would be the first thing you check? Probably error rate/ratio, the thing that you actually care about. If that is your workflow, why not just alert on your error rate in the first place and figure out the rest as you go along?
"The temperature is on a trajectory to breach the machine's operating limit in N minutes" seems pretty actionable.
I follow Google's symptom-based alerting philosophy in general, but will make an exception when there's a chance to catch something getting dangerously close to an unambiguous hard-failure limit (e.g. 90% quota utilization).
> "The temperature is on a trajectory to breach the machine's operating limit in N minutes" seems pretty actionable.
Do you really think that the folks in the Google datacenters are not monitoring rack temp? Implemnting Symptom based alerting does not mean you should not be monitoring other system metrics. Their monitoring system probably filed a P1 or P2 ticket for someone to go take a look at it at some point. But should a person be paged at 2 A.M to repair this? Absolutely not.
> "The temperature is on a trajectory to breach the machine's operating limit in N minutes" seems pretty actionable.
For who?
For an application, the job will just get rescheduled onto a different machine, and you're N+2 or whatever so no one will notice the restart.
For the datacenter, since (for most jobs) you can assume that no one will care, you don't necessarily monitor for single machines overheating, but rack or row level issues, if a machine keeps overheating, eventually "nothing runs on this machine for more than an hour without it restarting" -symptom based alerting kicks in and tells someone to look at the machine.
Hardware-based alerts shouldn't go to the SREs, but rather to the people running the datacenter, who are in charge of making sure that broken hardware is replaced.
Yes, and I am sure that they were monitoring this, but is a single rack having issues worth paging someone? Do you think they should immediately send someone out to investigate when temperature anomaly is detected? Like I said elsewhere, this was likely add to a queue of lower priority bugs
Yeah, that makes sense. I fixated on the "SRE" part of your post, rather than the "page in the middle of the night" part. I'm working on being less pedantic, so thanks for reminding me that I've still got work to do. <3
> What you are describing is caused based alerting, which is strongly discouraged by SRE. SREs prefer symptom based alerting (e.i are users seeing errors) because you only get alerted when you know that there is a problem that effects your business.
The truth, of course, is somewhere is the middle. Symptom-based alerting is extremely tricky to get right in the first place. What is a “within budget” for a service, can be total outage for a customer. Real example from the past: we paged GCS “error rate within this bucket is high, snapshot restore timeouts”, reply is “it is within our SLO”.
Another situation is when problems in background jobs will cause massive outages later. Again, real examples from the past: zone out of quota and, on another instance, multi-million bill due to GC not being monitored correctly.
> The truth, of course, is somewhere is the middle.
I agree, as I have stated in other comments, cause based signals should not alert, they should file a ticket, or be displayed along side a symptom based alert if there is a correlation. You should be monitoring as much as possible so you can observe your service in realtime.
> Symptom-based alerting is extremely tricky to get right in the first place.
Depends on the service. If you operating a HTTP API, a simple SLI would just be the error ratio of 500s to total requests. Its not perfect but it is much better than alerting based on CPU percentage or ram usage for a particular machine. Things can get more complex if your API is k8s style or you are operating a data plane that is serving customer traffic.
> What is a “within budget” for a service, can be total outage for a customer.
Now you are getting into SLO territory, which is similar to, but separate from symptom based alerting. Yes, defining good SLOs is very hard and varies greatly by the type of service you are running
> another situation is when problems in background jobs will cause massive outages later.
How do you know what signals to alert with? If you knew what causes/signals to alert, then wouldn't you design your backend jobs to not do the things that would cause an outage? Hindsight is always 20/20. If you are worried that a caused based signal from your backend _could_ be telling you that there _might_ be a problem in the future but your customers are not seeing errors, it is not business critical and you can just file a ticket and someone can look into it later.
A machine breaking is something you should respond to before it affects users. You don't have to react to high temps, but thermal throttling is just as serious as a dying hard drive.
Imagine if you didn't replace drives in your RAID until it hurt the user!
The SRE can ask the data centre worker to take a look - or better yet the data centre worker would be notified directly and the SRE wouldn’t need to be in the loop.
But the point is that warnings are often not good predictors of problems. There's a good argument that "warning" on anything at all is an anti-pattern, because you're just attracting valuable attention towards something that may or may not be anything to worry about.
Where warnings are useful are when "something is actually now going wrong" - they provide a valuable context that can help an engineer figure out what the issue might be, which is exactly how it worked in this case.
This is a completely backwards. What you are describing is caused based alerting, which is strongly discouraged by SRE. SREs prefer symptom based alerting (e.i are users seeing errors) because you only get alerted when you know that there is a problem that effects your business. If you only have caused based alerts, you will get false alarms and un-actionable pages all of the time.
Imagine being an SRE who manages the GFEs in this situation. Now imagine getting a page telling you that the tempurature in a rack where your job is running is hot. What are you supposed to do? Is the issue effecting users? If so, how many? Is it anomalous? Now go poke around the system and try to figure out what is wrong. What would be the first thing you check? Probably error rate/ratio, the thing that you actually care about. If that is your workflow, why not just alert on your error rate in the first place and figure out the rest as you go along?
Edit: typos