Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Modern Observability Problem (failingfast.io)
109 points by kiyanwang on Nov 20, 2022 | hide | past | favorite | 53 comments


People are seeing observability as a separate subject from product making. I see it as two side of the same "does this thing work" coin. To a product person, does it work mean are people using the tool to solve their problem. To engineer, it means the tool is solving the problem correctly. The former is answered by various analytics tool while the latter is answered by QA and observability tools.

IMO this is doubling effort. Given all the logging,metrics and traces, can we not tell how good the product is performing? Given we know how a feature is used by users, all the clicks and APIs, can we not guaranteed future changes not breaking a user journey? I'm at the moment prototyping a tool to capture my thinking in product and software development. Having OtEL is making my idea so much easier to realise


One major issue to reconcile is the granularity of data and retention. For engineers incentivized to fix problems, they need extremely high granularity, which is expensive to store long-term. Product Managers typically care more about aggregations that they want to track over a long period of time. In theory you could source these concerns from the same place, but it's challenging to do it, since you'd need to process, store, and query that data in different places. This doesn't make it impossible, and I think people should try it - but it may be too hard for a lot of organizations.


I served both role as PM and engineer. Having high level metrics is great when things are going smooth. You will need to drill down, zoom in into user journey when things don't work as expected. E.g. increased drop-off, 404s, 400s. Another aspect is causality, given a result, what caused it? Is it because a deployment, is it because a marketing campaign, or is it just seasonality. As a product person, they should want the ability to zoom in into data. Right now the only option is through communication.

As pointed out by sister comment, developer need to know how their code is impacting their business. I would even argue developers are inherently a product person. You are always building for someone, be it another developer, internal users of you company, or general public. You have to know how people are using your tool. Hence as a developer you should want to be able to zoom out from whatever observability platform is currently offering.

Guess what, developer thinks product managers are stupid because all they know is either looking at data and measuring the wrong thing, or chasing for status update. Product managers thinks developer are nerds that only know punching on keyboard and not talking to the users. They are both right and wrong. But we create the chasm ourselves. There are alternative ways of building things.


> People are seeing observability as a separate subject from product making. I see it as two side of the same "does this thing work" coin.

I have noticed the same thing with performance and UX. Most SaaS vendors only care about client / frontend UX and neglect the backend aspect of it. After a few years of intense development, they own good looking software that is also mindbogglingly slow and in some extreme cases unusable.


Mixing up these two concerns is a bad idea. Your product manager now has to convince engineers to implement the metrics they want. Data that is irrelevant to engineers now needs dragging along to places inside the system so that events are useful to both engineers who car about data and latency and product managers who care about individual user behaviour.

So while there is indeed overlap, the two types of metrics involve different people, different teams, different concerns, different data, different privacy concerns, and different reliability/uptime concerns.


All of your arguments are great, for why you need to treat observability as a core product feature and tie it to core product features.

Engineers need to be serving the orgs goals, and one of the best ways to do that is to give them incentive and tooling to be thinking about the same things the product owner is. Have them build metrics for business metrics. Those are the things you should be alerting on, and to the extent you alert on technical pieces it should be in the service of those business metrics; don't alert on latency because it's an arbitrary goal, alert on latency because you can see how it affects user experience.


Structure follows strategy. There are various reasons product manager exist but the role doesn't exist unless the team grows to a certain scale. All founders serve as a product manager some point in their start up journey because they need to know if the thing the built works. There are companies wanting to spend energy on building things that work and skipping a layer of communication is one way of doing it.

That's also how compiler speed up a program sometime, by inline expansion or by loop unrolling. It doesn't work all the time. But as an option.

Right now there seem to be an orthodox in product development


Observability is the other side of statistics.


We run a pretty complicated SaaS system.

All these tools have their limitations (and we have all of them, we use Prometheus, we have tracing, we have logs - your entire stack of everything ;) ). There is a limit to your ability to tell what's going on inside a black box based on those, sometimes they'll answer the question you're interested in, and sometimes they will not. As pointed somewhere else in the comments tracing every single interaction in your system doesn't work/scale and often the one failure you care about is not going to leave a trace. Similarly with metrics at some point just measuring everything with the right labels becomes too expensive. More than once I'm looking for a specific metric to help troubleshoot something and we don't have it (despite having a ton of metrics for everything). Alerting on metrics can be very tricky because you may not have good context, some requests might be slow because they're big, some might be fast, finding rules that tell you when the system isn't behaving is extremely difficult. Usually it's the users/customers that are going to tell you that.

Adding metrics, tracing, alerts, dashboards etc. etc. takes time/effort. This needs to be weighed against time spent on other things that can improve the quality of the product. Like design, testing, etc. Really understanding what the requirements are and how the system behaves. Just because Google or Meta set that balance somewhere doesn't mean you need to. Likely your system is significantly smaller and less complex.

This is not a new problem, logging and other methods of observability have been with us since the beginning of time, and it's always been something that needs to be approached with balance. There's some logging that adds value and there's some point where it is counter-productive. When things break, more than often the logging just gives you a starting point for debugging- not the answer.

My personal philosophy is invest in quality early on and you will reduce your operational costs. Simpler and more reliable software needs less monitoring and conversely no amount of monitoring is going to turn poor quality software into reliable software. There are many domains where the software just has to be right (lessay in your car or airplane) and you can't rely on someone monitoring the software to go and fix things if they go wrong... That said, it's always about the balance. You shouldn't care about the fashion of the day or what Google does. You need to decide where the balance is for your product that optimizes things over its lifetime with the given constraints. Every project is going to have a different balance.


If you are thinking on adopting OpenTelemetry, you should check out Odigos: https://github.com/keyval-dev/odigos (I’m the author). This tools handles instrumentation for any application (even including Go) and also manages the collectors pipeline.


Increasingly, the network is a failure boundary people take for granted.

Micro/3rd party services exacerbate this problem. You may see latency increase for a particular call, but what tells you why latency is increasing? What's measuring all of these tech choices, how do you know your 3rd party api is serving you traffic reliably?


I've become a proponent of OtEL tracing in recent months having used it successfully to diagnose some performance issues in multi-language, multi-service systems. I've found it also useful in single-process scenarios where heavy use of "async" prevails. Async-ish things (Kotlin coroutines and Scala futures in this particular case) make it hard to reason about the linear behavior of code using traditional debugging tools, I find. Disclosure: I've also made a couple of very small contributions to the project.


I’m curious, do you have more details?


OpenTelemetry is moribund and a technological dead-end. Godspeed to any org that builds on it.


I've seen it hyped up like crazy, what's so bad about it and what is the superior alternative?


Instrumentation of a given system is by definition always "larger" than the nominal traffic which flows through that system. That is, given some request R, the possible observable metadata of that request R is essentially infinite. So when you're designing an observability system, the whole ball game is about capturing and constraining the cardinality of that metadata. Observability as a field is _entirely_ about reducing the volume of telemetry data such that it can be usefully captured, aggregated, and queried by operators. That's why we have metrics, traces, and logs as distinct things. They're all optimizations of the same underlying information, each optimized for different consumer use cases.

Concretely: if user U1 does an HTTP GET to /spotify/track/123, that's perhaps 10KB of production traffic, but easily 10MB of telemetry traffic. In effect that telemetry can be modeled as a huge key/val map of metadata, but you can't do it that way and remain efficient, you have to optimize for observability use cases. You have to increment a cardinality-bound set of metric counters for the request outcomes, and maybe emit some best-effort trace data for the request ID, and etc. etc. — _as separate things_!

The engineering costs dominate the design. But OpenTelemetry says this isn't the case. OpenTelemetry says that whatever requirements are on FX00 company CTO feature checklists are valid a priori, and commits to doing whatever is necessary to satisfy them. That's because OpenTelemetry is evaluated not on any technical merits, but on the adoption rate of the CNCF stack among those FX00 companies.

OpenTelemetry is explicitly and exclusively a thing meant to tick off an "observability" checklist item on the checklist of a FX00 CTO's due diligence form. That's it. That's the only goal. Nobody with a choice should be using it. Read the source code, it's abysmal.

Alternatives? Write code that leverages each pillar of observability directly. There's no short-cut. That's the whole point.


> Alternatives? Write code that leverages each pillar of observability directly. There's no short-cut. That's the whole point.

I think I am not able to follow you correctly. Is your entire point that auto-instrumentation is too much and one should default to manual instrumentation instead?


You can't delegate "instrumentation" as a monolithic concern to a single authority or package or library like OpenTelemetry and expect to get anything but noise on the other side. And yeah, basically any vendor claiming to do do "auto-instrumentation" is selling snake-oil. Expensive snake-oil, too!

Observability is not something that a vendor can provide without meaningful and deep integration in your infrastructure. You have to do some amount of work, and IMO few vendors deliver value beyond what a single engineer can produce with with a basic internal Prometheus infrastructure + short-term log aggregation.


But OpenTelemetry never claimed to be an auto-instrumentation only library, you can very well just manually instrument your application with the metrics you want and export it your self managed prometheus + log infrastructure. In fact, OTel makes it easier for you if in the future you might want to move to a better self managed TSDB + Log Infra because you know it most likely have an exporter ready with zero effort to re-instrument your metrics or logs


You can definitely define an abstract concept of telemetry which generalizes over use cases, and which can be expressed as a single schema that can be collected generically. The problem is that this definition is too general to be practically useful.

The whole ball game for observability systems is optimization for specific consumption use cases. That _must_ occur at the point of origin, it _cannot_ be deferred. OTel says that it's possible to define a general-purpose exporter for arbitrary telemetry data, and that specialization and optimization of that generalized data can be done later, downstream. This is simply not true.


Could you point me to docs, architecture notes or code where they explicitly say auto-instrumentation is the ONLY approach to using OTel? AFAIK both auto-instrumentation and manual instrumentation are equally supported - https://opentelemetry.io/docs/concepts/instrumenting/


Few of the newer vendors have evangelized the "capture everything, ignore costs" mantra leading people to places where angels fear to tread.


> So when you're designing an observability system, the whole ball game is about capturing and constraining the cardinality of that metadata.

If your world is only time series databases that struggle with cardinality, then sure. Fortunately, there's a lot more tools out there, some of which do just fine with high cardinality data.

I don't really agree much with your entire comment. I think you're looking at observability through the lens of 2010s-era tools, and falling deeply into the trap of thinking that this is the only way to do things.


My core claim is that telemetry data for a system is always more than the production data for that system. Yes? No?

Assuming yes, everything else I'm claiming is noncontroversial.


I don't think anyone opposes that, but I don't think that's your "core claim". Your core claim is that this is a huge problem, one that makes OpenTelemetry "a technological dead-end", and that "the whole ball game is about capturing and constraining the cardinality of that metadata".

That is what it would be nice to have you justify some more. After all, the same could be said of logs, or internal traffic when using microservices/DBMS (request cardinality/traffic will be multiplied).


I count about a dozen claims, so I'm not really sure what to begin with. But I disagree with your comment here too.


(shrug) OK. I'm a domain expert here, but that doesn't mean my claims are infallible, of course. Your call. Good luck.


most or all of the observability domain experts i'm familiar with either stopped talking about "pillars" several years ago, or have been actively speaking against the "3 pillars" framing for several years

most of your takes here sound like they're from somewhere around 2016-2018


Oh no. Now what should I use? I can't believe something that's actively in development is considered dead these days. Surely there's a better alternative with as good or better community support, right? Right...?


The person you're replying to seems to have an idealogical bent against OTel. I'll say that "moribund" is about as unrealistic a term as it gets for OTel, given that it's now the #2 CNCF project by activity and still growing.

That said, OTel is far from being easy to adopt still, and despite a lot of us trying hard to change that, it's got a long way to go. If you're using one of the "major" languages (Java, .NET, JS, Python) then it's pretty easy to set up automatic instrumentation and a Collector that you can tee off to your preferred backend analysis tool. But if you need more context from your apps, manual instrumentation outside of tracing is pretty hit-or-miss, and you need to build up a vocabulary around concepts (Resources, attributes, baggage, context, spans, etc.) that isn't easy unless you've got the time to sink into it. It's extremely powerful and has the building blocks to let you capture just about any data you need, pluggable processors and exporters, etc. -- but these building blocks are very unevenly composed into easy-to-use, turnkey-ish components that a lot of people ultimately want.


The problem isn't a lack of community, it's that the project is solving a problem that isn't relevant to anyone except Fortune-X00 companies.


Yeah I'm gonna call bullshit on that one. I work daily with everyone from startups and nonprofits to big banks. They use OTel because it solves problems for them. They don't all use it the same way, but they use it.


So, public sector here. We use Honeycomb with OTel and it's most definitely solving our problems. It's incredible being able to see traces up and down our whole stack and pinpoint where an issue is.


What RPS do you serve, net? Everything works to trivial scale, e.g. O(10k) RPS and below. Do you do more than that per server?


You sound like you're trying to have it both ways: "OTel solves a problem that is irrelevant unless you're massive" but also "well of COURSE it can work if you're not massive, how massive are you?"

There are tons of startups and other not-fortune-x00 orgs benefitting from what OTel provides. Your claim that OTel is irrelevant outside fortune-x00 cos is very clearly not true.


I don't agree that OTel delivers value to not-fortune-x00 orgs. It's bloatware and whatever benefits it delivers are more cheaply achieved with other solutions.


Well, we're certainly not Fortune 500 scale, so it works for us.


Too much telemetry is more a problem that not enough in my recent experience. I am 100% sure the line I need to find out what happened is there in kibana but every extra filter term to make a trillion lines of log output down to a specific time and sequence, adds a risk of filtering out exactly what I want.


What you're describing is why tracing (it's not just for distributed systems!) and tail sampling are employed in practice over just tons of log lines. Structuring the data your generate and sampling it is how to approach this. If you sample 100% of errors (or some other meaningful signal), 1% or so of the rest - just so you have a good baseline to compare - and attach metadata for a tool to reweight counts you can more-or-less have your cake and eat it too.


In general, Erlang/Elixir solved this challenge in an interesting way: with peer state awareness and channels (functionally equivalent to most micro-services use cases). It is commonly the back-end secret in a lot of low-latency game back-ends.

If you are stuck on a polyglot loving project, than RabbitMQ or Kafka can bolt on most functions with the standard AMQP services. Erlang/Elixir is weird, but is a single kind of weird... which even has built-in profiling tools without external dependencies.

Best of luck, =)


Erlang asserts two simplifying assumptions about distributed computation:

1. Actors can define incoming message queues with unbounded capacity

2. Actors can always crash in response to any given error condition

These assumptions produce a coherent system model, which is unfortunately incompatible with any physical system that can implement it -- unbounded queues are a fiction -- and which is effectively non-deterministic and unpredictable -- crash-only software makes it impossible to assert any guarantees on a callstack.

Erlang represents a sub-optimal local maximum. It's no panacea.


"It's no panacea." Agreed, but it does offer a unified model that abstracts away the repetitive error-prone duplicated code. The upfront training costs are offset by locking out vendor proprietary extortion-ware.

Admittedly, it probably isn't worth the effort if you have less than 48k concurrent users. But, the distributed consensus implementation still need added to many projects (RabbitMQ essentially packages these features for you).

From my perspective, it comes down to two choices:

1. Clown show Federation and Shovel for an extra dose of chaos

2. Monoculture Clustering with the OTP, and partitioning by region

Then again, I am not smart... so YMMV =)


Erlang asserts that all errors can be grouped together and managed as if they are of the same category, i.e. that any error is crashworthy. This model of errors isn't wrong, but it is naïve, and it makes it impossible to design a system whose entities are resilient to what are, factually, normal runtime faults.

Erlang and the OTP are interesting and worth studying and may be a good design choice under certain constraints. But in no way does the crash-only model of the Erlang OTP represent an optimal architectural model in general. Technology has advanced since 1960.


I believe you’re oversimplifying.

The approach that seems most common with serious Erlang systems is to identify the common errors and handle them without crashing.

Crashing on less common, or impossible to reproduce, errors allows you to gradually improve your error handling over time if it’s worthwhile, or just allow it to continue to crash into a saner state.

See the “bohrbug” vs “heisenbug” discussion: https://ferd.ca/the-zen-of-erlang.html


"Technology has advanced since 1960"... I would argue not significantly, and in some metrics it is now less reliable. When you see GPU lock up and burn down a power supply, the concept of hardware failure modes will change. LOL

Cheers ;)


Good luck to you, then.


I have no experience with Erlang, but wouldn't ABENDing one or both of the sending / receiving actors if a queue spills over be entirely consistent with both the model described and the finite nature of real systems?

I'm also curious about this:

> ...crash-only software makes it impossible to assert any guarantees on a callstack.

What sort of guarantees are we talking about here?


I would recommend: "Designing Elixir Systems with OTP: Write Highly Scalable, Self-Healing Software with Layers" ( James Edward Gray, II, Bruce A. Tate )

Erlang, and its simplified Elixir wrapper language can allow a OTP child process to continually crash while remaining operational for other users.

Some tend to get irritated by the idea of graceful fail back code. Again, something a dead-letter queue in RabbitMQ handles rather trivially.

Have a wonderful day. =)


I've generally seen three approaches to this:

  - use cloud offerings because it's easy to integrate with them and they're one of the better options out there; which isn't viable in some contexts, or when you don't have money allocated for this
  - setup the full Elastic Stack or Sentry, or something enterprise like that and have your stack be composed of multiple interconnected pieces of software, that need constant maintenance or even people to constantly manage them, as well as a non-insignificant amount of resources
  - go for a lightweight offering, like JavaMelody for Java applications, or some of the simpler fully featured stacks, like Apache Skywalking and try to make do with their more limited feature sets and possibly more limited documentation
For me, Apache Skywalking feels "good enough", although definitely not perfect: https://skywalking.apache.org/

The Docker Compose stack for it doesn't look as complicated as that of Sentry, it's basically an almost monolithic piece of software like Zabbix is and it works okay. The UI is reasonably sane to navigate and you have agents that you can connect with most popular languages out there.

That said, the UI sometimes feels a bit janky, the documentation isn't exactly ideal and the community could definitely be bigger (niche language support). Also, ElasticSearch as the data store feels too resource intensive, I wonder if I could move to MySQL/MariaDB/PostgreSQL for smaller amounts of data.

Then again, if I could make monitoring and observability someone else's problem, I'd prosper more, so it depends on your circumstances.


What I like about ETL tools like Dagster and Prefect is that you get observability “for free”. You can set the granularity by deciding what is a task/job/flow/op and how they’re grouped together. And then in one UI you get logs, metrics, a waterfall view with timed executions, all kinds of useful information.

It’s so useful that sometimes I’m tempted to reach for it in non-ETL contexts. My problem is that these tools generally don’t mesh well with real-time streaming requirements.


Adding telemetry plus yet another service to fix the complexity of having a lot of services sounds a bit like de-escalating a conflict by slapping someone in the face.

It always depends on the specific use case of course, but maybe it's also worth investigating if reducing the overall complexity of the system + amount of microservices could be a solution.


Well, you are gonna shutdown this budding cottage industry of Observability Consultants :)


heh, seriously.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: