> The temperature spike should have been the first alert. The throttling should ...

closeparen · on March 15, 2020

"The temperature is on a trajectory to breach the machine's operating limit in N minutes" seems pretty actionable.

I follow Google's symptom-based alerting philosophy in general, but will make an exception when there's a chance to catch something getting dangerously close to an unambiguous hard-failure limit (e.g. 90% quota utilization).

thethethethe · on March 15, 2020

> "The temperature is on a trajectory to breach the machine's operating limit in N minutes" seems pretty actionable.

Do you really think that the folks in the Google datacenters are not monitoring rack temp? Implemnting Symptom based alerting does not mean you should not be monitoring other system metrics. Their monitoring system probably filed a P1 or P2 ticket for someone to go take a look at it at some point. But should a person be paged at 2 A.M to repair this? Absolutely not.

joshuamorton · on March 15, 2020

> "The temperature is on a trajectory to breach the machine's operating limit in N minutes" seems pretty actionable.

For who?

For an application, the job will just get rescheduled onto a different machine, and you're N+2 or whatever so no one will notice the restart.

For the datacenter, since (for most jobs) you can assume that no one will care, you don't necessarily monitor for single machines overheating, but rack or row level issues, if a machine keeps overheating, eventually "nothing runs on this machine for more than an hour without it restarting" -symptom based alerting kicks in and tells someone to look at the machine.

fwip · on March 15, 2020

Hardware-based alerts shouldn't go to the SREs, but rather to the people running the datacenter, who are in charge of making sure that broken hardware is replaced.

thethethethe · on March 15, 2020

Yes, and I am sure that they were monitoring this, but is a single rack having issues worth paging someone? Do you think they should immediately send someone out to investigate when temperature anomaly is detected? Like I said elsewhere, this was likely add to a queue of lower priority bugs

fwip · on March 15, 2020

Yeah, that makes sense. I fixated on the "SRE" part of your post, rather than the "page in the middle of the night" part. I'm working on being less pedantic, so thanks for reminding me that I've still got work to do. <3

vl · on March 15, 2020

> What you are describing is caused based alerting, which is strongly discouraged by SRE. SREs prefer symptom based alerting (e.i are users seeing errors) because you only get alerted when you know that there is a problem that effects your business.

The truth, of course, is somewhere is the middle. Symptom-based alerting is extremely tricky to get right in the first place. What is a “within budget” for a service, can be total outage for a customer. Real example from the past: we paged GCS “error rate within this bucket is high, snapshot restore timeouts”, reply is “it is within our SLO”.

Another situation is when problems in background jobs will cause massive outages later. Again, real examples from the past: zone out of quota and, on another instance, multi-million bill due to GC not being monitored correctly.

thethethethe · on March 16, 2020

> The truth, of course, is somewhere is the middle.

I agree, as I have stated in other comments, cause based signals should not alert, they should file a ticket, or be displayed along side a symptom based alert if there is a correlation. You should be monitoring as much as possible so you can observe your service in realtime.

> Symptom-based alerting is extremely tricky to get right in the first place.

Depends on the service. If you operating a HTTP API, a simple SLI would just be the error ratio of 500s to total requests. Its not perfect but it is much better than alerting based on CPU percentage or ram usage for a particular machine. Things can get more complex if your API is k8s style or you are operating a data plane that is serving customer traffic.

> What is a “within budget” for a service, can be total outage for a customer.

Now you are getting into SLO territory, which is similar to, but separate from symptom based alerting. Yes, defining good SLOs is very hard and varies greatly by the type of service you are running

> another situation is when problems in background jobs will cause massive outages later.

How do you know what signals to alert with? If you knew what causes/signals to alert, then wouldn't you design your backend jobs to not do the things that would cause an outage? Hindsight is always 20/20. If you are worried that a caused based signal from your backend _could_ be telling you that there _might_ be a problem in the future but your customers are not seeing errors, it is not business critical and you can just file a ticket and someone can look into it later.

Dylan16807 · on March 16, 2020

A machine breaking is something you should respond to before it affects users. You don't have to react to high temps, but thermal throttling is just as serious as a dying hard drive.

Imagine if you didn't replace drives in your RAID until it hurt the user!

chrisseaton · on March 15, 2020

> Now imagine getting a page telling you that the tempurature [sic] in a rack where your job is running is hot. What are you supposed to do?

Go and take a look at the rack? You'd have found and fixed the problem immediately in this case.

thethethethe · on March 16, 2020

You are missing the point. The scenario I presented was from the SREs perspective, not the workers in the datacenter

chrisseaton · on March 16, 2020

The SRE can ask the data centre worker to take a look - or better yet the data centre worker would be notified directly and the SRE wouldn’t need to be in the loop.

thethethethe · on March 16, 2020

> The SRE can ask the data centre worker to take a look

This is exactly what the blog post described

growse · on March 16, 2020

Why would you waste time doing that when you don't even know there's a problem?

chrisseaton · on March 16, 2020

Why does anyone heed a warning? In order to catch an issue before they get an error.

growse · on March 16, 2020

But the point is that warnings are often not good predictors of problems. There's a good argument that "warning" on anything at all is an anti-pattern, because you're just attracting valuable attention towards something that may or may not be anything to worry about.

Where warnings are useful are when "something is actually now going wrong" - they provide a valuable context that can help an engineer figure out what the issue might be, which is exactly how it worked in this case.