We run a pretty complicated SaaS system. All these tools have their limitations ...

We run a pretty complicated SaaS system.

All these tools have their limitations (and we have all of them, we use Prometheus, we have tracing, we have logs - your entire stack of everything ;) ). There is a limit to your ability to tell what's going on inside a black box based on those, sometimes they'll answer the question you're interested in, and sometimes they will not. As pointed somewhere else in the comments tracing every single interaction in your system doesn't work/scale and often the one failure you care about is not going to leave a trace. Similarly with metrics at some point just measuring everything with the right labels becomes too expensive. More than once I'm looking for a specific metric to help troubleshoot something and we don't have it (despite having a ton of metrics for everything). Alerting on metrics can be very tricky because you may not have good context, some requests might be slow because they're big, some might be fast, finding rules that tell you when the system isn't behaving is extremely difficult. Usually it's the users/customers that are going to tell you that.

Adding metrics, tracing, alerts, dashboards etc. etc. takes time/effort. This needs to be weighed against time spent on other things that can improve the quality of the product. Like design, testing, etc. Really understanding what the requirements are and how the system behaves. Just because Google or Meta set that balance somewhere doesn't mean you need to. Likely your system is significantly smaller and less complex.

This is not a new problem, logging and other methods of observability have been with us since the beginning of time, and it's always been something that needs to be approached with balance. There's some logging that adds value and there's some point where it is counter-productive. When things break, more than often the logging just gives you a starting point for debugging- not the answer.

My personal philosophy is invest in quality early on and you will reduce your operational costs. Simpler and more reliable software needs less monitoring and conversely no amount of monitoring is going to turn poor quality software into reliable software. There are many domains where the software just has to be right (lessay in your car or airplane) and you can't rely on someone monitoring the software to go and fix things if they go wrong... That said, it's always about the balance. You shouldn't care about the fashion of the day or what Google does. You need to decide where the balance is for your product that optimizes things over its lifetime with the given constraints. Every project is going to have a different balance.