Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Efficiency trades off against resiliency (nelhage.com)
124 points by tim_sw on April 16, 2023 | hide | past | favorite | 63 comments


While I generally agree that the pattern this article describes is real (there's some degree of tradeoff between robustness and efficiency), I've frequently seen engineers fall back on that logic rather than actually thinking about the specific problem they're facing, and spending five minutes trying to come up with a creative solution.

For example, talking about the CPU utilization of a web service, your service can perform a mix of time-critical work (serving queries) and background tasks (indexing, etc). If a server is running at 100% CPU but 30% of that is spent performing background work, the amount of slack available for a sudden surge in demand is 30%, and I suspect it's almost never the case that a large service sees an unexpected 30% load increase in less time than it takes to boot another machine, so such a system could be both efficient and robust with few downsides.

Implementing that system isn't as easy as just throwing more Kubernetes nodes at your problem, especially given that the ecosystem of tooling isn't designed to make it easy. For example, it would be really nice if load balancers used realtime performance metrics to balance traffic at a millisecond level.

Perhaps the real lesson is that "many of the easiest ways to achieve robustness involve sacrificing large amounts of performance", but I reject the idea that we should use that as an excuse to accept terrible performance, with all the monetary and environmental impacts it brings.


> If a server is running at 100% CPU but 30% of that is spent performing background work, the amount of slack available for a sudden surge in demand is 30%,

In this context, the first half of that sentence is usually interpreted as something like "every 10 days, the computer is capable of doing 10^8 tasks, and there arrives 7×10^7 time-critical tasks and 3×10^7 background tasks."

As you can see, if the background tasks eventually need to get done, demanding more of this system doesn't work because it would not be able to finish what it's supposed to. 100 % utilisation leaves no slack, no matter what kind of tasks they are.

Your proposal only works if

- Utilisation is less than 100 % when looking at a longer time frame (then we can do fewer background tasks during high load but catch up on the backlog when load is lower);

- You are able to spin up new workers in response to increased load (this is the same as utilisation being lower than 100 %); or

- The background tasks aren't actually demand at all, but just things that are nice to do opportunistically. (And then again, utilisation is lower than 100 % even if it doesn't seem this way.)


I was assuming that it's option 2 (you can spin up new workers within 10 days). If you're using cloud compute this is almost always true.

If you're building an on-prem cluster, I'm assuming you either spin up cloud workers for the extra load, have extra servers that are fully powered off and can be booted in a few minutes, or just physically order and install new hardware. Amazon can ship you a computer in 2 days, so it's not inconceivable to design an on-prem infrastructure to allow significant scaling with short notice.

Option 3 is sorta always true as well, since there's always compute load at your company that's lower priority, and can be shed in an emergency (for example, your CI system).

In any case, I suspect the common pattern of scaling is that most companies grow in a smooth enough way that they can predict demand in future weeks within a couple percent, so you can just run at 98% CPU instead of 100%


The idea that you can run an on-premise cluster and just supplement it with cloud workers is a nice idea in theory. But the reality is that this never happens.

Because the two main reasons that a company would be running an on-premise cluster is (a) security and (b) cost. Both of which conflict with running a hybrid model because it is extremely complex and expensive to do so.


“Never” is way too strong.

This technique is called “cloud bursting” and is done often enough to have docs on Azure: https://azure.microsoft.com/en-us/resources/cloud-computing-...

More generally, hybrid on/off prem (or multi-cloud) is a top level feature of GCP’s Anthos.

You’re not wrong about your two main reasons, and it’s true that it’s hard to wire up. But you definitely can wire up hybrid workloads in some cases.

The obvious one would be stateless batch compute like image/video processing; “I prefer to re-encode my videos on prem but can burst to the cloud if I get a traffic spike”. This might end up being cheaper than either of all-on-prem with overcapacity, or all off-prem with auto scaling.

I agree it’s not likely to be useful for something where security is your reason for going on-prem.


Good points.

For security concerns, you would probably be selective about which services and workloads you move to cloud compute, possibly preferring to offload internal services like CI before customer data.

Regarding cost, are you implying that there's large monetary cost to preserving the ability to spin up cloud workers in an emergency, or just that once you do it's expensive per day to keep them running? Presumably if you're running an on-prem cluster for cost reasons, you would work to scale it up quickly after an unexpected sustained demand spike, and the cloud supplement would be temporary. If you're getting "unexpected" spikes every week, you need to work harder at forecasting.

In any case, I suspect today that people running on-prem clusters provision them with the intent of being able to handle a load growth and keep weekly average utilization at <60% during the spike. Investing in building infrastructure that keeps working smoothly when weekly average load is 90-100% means you can provision a smaller cluster either way.


Perhaps today this is true.

In 2007 I was working for a webhostet / nascent iaas provider, and we had workloads doing exactly that.

Specific example would be Celtic FC who had a baseline of dedicated, but would scale into our VMs during events, i.e. uefa cup games.


Reaching into on-prem hardware from the cloud is a substantial engineering effort, and one that can land you in the news if you do it wrong.


Yep. For example, I use control theory to keep my services at just single percents below maximum throughput achievable on the server. Then I use other tricks (like batching processing) to make the application MORE efficient as the traffic increases.

The end effect is I can just back off 1-5% off the maximum throughput and keep the service there running happily.

I would like to use this occasion to point out that all the discussion about unused CPU is at this time completely pointless.

Most services I have seen waste ORDERS of magnitude by being inefficient. Rather than focusing on trying to saturate the CPU and other resources it is almost always better to just make your application more efficient. That last 30% should be a cherry on top.


Batching (and sorting and merging) are things our predecessors in the 1950s and 1960s (and before that, in the card era) had to do to run anything at all. These days they are things that we may do to make sluggish systems snappy.


How do you use control theory?


You know how many systems have "performance" configuration? I use a controller that monitors the state of the system and changes these parameters in real time to regulate system to stay within desired state when the environment of the system changes.

As a very simplified example, imagine a backend service that is being called by external customers and does not control how those customers are calling the service. I can add a delay to each response and I can have even something as simple as PID controller regulate the CPU usage by changing the dalay. Larger delay will usually cause the clients to slow down requests (requests being usually a result of previous request completing). This is simple and naive example but this is more or less what I do.

(Of course, in reality, it is much better to just have a backpressure mechanism and whenever possible you should use one rather than try to work around HTTP inadequacy. But you can't always do it, especially if you have a public API.)

I also typically have lots of other controllers. For example something that regulates memory usage by limiting transactions in flight or something that regulates latency as seen by priority clients or database replication rate/delay, or error rates or a bunch of other parameters.

I also routinely take care of babysitting downstream systems like databases or other APIs. I may have a regulator that will automatically start backing off certain types of traffic as a response to increasing error rates or latencies in a downstream system. All this because those downstream systems are usually shit and not designed to deal with overload and it is easier for me to deal with this proactively than do what everybody else does -- keep bugging those people to fix their issues when their evidently don't know how.


This is extremely interesting!

I have been trying to move away from dumb rate limiting to a more holistic approach that allows us to make smarter decisions with traffic. Your overview made me intrigued.

Do you have any references you like to use? I am looking at the Wikipedia page, but it's so removed from practical aspects.


I don't. There simply isn't any tooling or literature to speak of. I have some experience using control engineering in my electronics projects and that's how I came up with the idea to use it for backend systems. I have researched and developed everything myself. I have used "Modern Control Engineering" by Katsuhiko Ogata, but really, mostly I just learned from the Internet.

My initial motivation was to remove configuration. I have found, historically, that giving people options to configure very complex software more often than not results in problems, especially after original developers leave. More often than not these new people will not understand the implications or interactions between various settings and this will just cause problems. So my aim became to remove any options from the software and make sure it can perform autonomously and recover from wide range of, possibly unknown, situations. Which is exactly what control engineering is about if you think about it!


If one day you'll write a blog post / article about what you're doing, it'd be interesting to read :- )

(What if you start collecting email addresses to people who want to read such an article? And if one day you write one, then you can email them? — My email is in my profile, if you'd like to add it to such a list)


Dunno... most people I meet seem to be put off by my software development ideas. I stick to them because they seem to work very well even if it initially creates a lot of friction between me, the team and the management.

Where to start... I think test driven development and unit testing is not giving promised value and instead wastes time and makes software more difficult to refactor and I think functional end to end testing to be much more effective and cost effective. I think code reviews are bad because they don't deliver on promised value and individual craftsmanship (peoples ability to deliver on their own) and pair programming are better. I think microservices to be a wrong approach for 99.9% projects and fixed a bunch of projects by rolling the software into monolyths. I believe bugs can only be truly reduced by taking responsibility for writing correct code in the first place and anything afterwards is expensive and not effective (you can only remove bugs that manifest themselves, everything else stays). I don't compile/run my code multiple times a day -- I write it all in one go, sometimes for weeks, then run it. If it works it means I know what I am doing and if it doesn't -- it is the failure of my process. Where most devs just fix the bug and restart the app I will start an investigation into why my process failed and how I need to fix it -- NTSB-style. I believe that nobody understand what Agile is and the way it is applied is damaging to software industry. I don't believe in linear development progress -- I design my apps top down and at the same time program them bottom up until top down and bottom up meet together. I structure my development process around rewriting the software -- I write the first version and then I will refactor/rewrite to remove any unnecessary complexity until I am happy with it. There is no working software for a long time and then suddenly it is complete. And when it is complete there is no more testing stages, bugs to fix -- it is truly complete.

So you see, I am probably too alien a developer to give advice to general population of developers.

And when I do talk about my ideas it usually ends in flame wars or drowns being downvoted to hell because people tend to downvote anything and everything that does not confirm their existing worldviews.


Beware of I/O, memory usage and cache eviction.

A low-priority background job can issue I/O requests that result in high I/O latencies for everything else (by e.g. making lot of random seeks).

This is very easy to get when using spinning rust.


If cache eviction and I/O usage are limiting, perhaps a better way to design this system is as a cluster management system that can very quickly change the number of cores allocated to foreground vs background tasks (within less than your latency SLO). That means each core is only doing one kind of work for many milliseconds in a row so CPU caches remain hot, and I/O can be sorted into separate low and high priority queues. IO blocking is much less of a concern with NVMe drvies anyway, since they're intrinsically so parallel.

Memory usage is still a concern, since you do need to have the code and in-memory data for both workloads loaded on some number of your servers in order to rapidly change what tasks they're doing. Physical machines have so much memory today (compared to the code size or working set of most programs) that I suspect this isn't actually a blocker in practice, and it's not necessarily required for every machine in the cluster to have both workloads resident as long as some of them do.

This definitely feels like diving into an alternate universe of cluster management, and I'm not sure how easily Linux gives you tools to do things like this.


A long time ago, I was surprised to discover that a mainframe always runs at 100% cpu usage. Its OS makes sure that multiple VMs, each of which has a long list of real-time and batch jobs, neatly prioritized, keep it well fed. An interactive user need an answer now, but a clean-up job can literally take weeks to run. When I looked amazed, the operators told me ‘This thing is so expensive- it cannot be wasted’


> it would be really nice if load balancers used realtime performance metrics to balance traffic at a millisecond level

With Kubernetes this is trivial. Horizontal Pod Autoscaling integrates with Prometheus so can spin up a new instance of your service based on whatever custom metrics you like.

And it is proven to work without requiring some "creative solution" that will be buggier, less secure and inevitably less maintained than something that is industry standard.


I don't think you understood what I was suggesting here, I'm referring to the load balancing system not auto scaling. Imagine a cluster with two nodes, that gets three requests. A round robin load balancer will distribute the first and third to node 1, and the second to node 2. However, if request 1 requires a lot of work for some reason (either it's an intrinsically complicated request or triggers gc or something), that means node 1 ends up heavily loaded and node 2 is underutilized. A smarter load balancer could notice that node 1 is still busy, and redirect request 3 to the node that has capacity available at the instant it comes in.

Given large enough numbers of small requests, it should all average out in the long run, but that requires over provisioning by enough that servers can still handle their "fair share" of requests even while running GC or otherwise dealing with unusually difficult requests. Realtime smarter load balancing should require less overhead.


Prometheus+Kubernetes needs in the order of 30-60s to scale up, not a good match with GP's "millisecond level" ask.

(scrape interval is 15s, pod creation requires pulling images)


> (scrape interval is 15s, pod creation requires pulling images)

You can reduce the scrape interval and depending on the cluster provider the image pulling should be extremely fast once cached but indeed it would be on the order of many seconds rather than milliseconds.


On the one hand, this is somewhat true, but on the other hand, there are lots of efficiencies you can get without trading resiliency, and the author sets up a few false dichotomies here.

The author mentions JSON vs struct serialization as an example, but flatbuffers and protobufs (or any binary protocol format that builds in an ID and a version number) give you the same resiliency benefits mentioned here. You don't have to go all the way to raw structs to gain efficiency over using JSON. Only the last tiny bit of efficiency (the gap between a struct and a flatbuffer) actually comes at a meaningful resiliency cost.

The same goes for single-threaded vs multi-threaded services and single-machine vs horizontal scaling. With some thought, you can do all of these things and create highly-scalable systems very efficiently - not quite as efficiently as a single thread, but a lot more efficiently than the standard "web backend" thing of creating 1000 microservices using 10 different databases and 50 caching layers for that single task. That efficiency is available at no cost to resiliency.

The common thread of "web" solutions like microservices and JSON is that they help you save developer time. That is orthogonal to the efficiency vs resiliency tradeoff.


Serialization/deserialization is still quite costly even with protobufs/flat buffers. This is not the last tiny bit. In some applications, I've seen it take more than 15% of the total CPU usage.


In some applications, this is true. I think 15% is pretty extreme, though, unless what you're doing is something like parsing/ creating data feeds (where the serialization is the point). In those cases, it's probably a good idea to have your own format. Still, JSON would be a lot worse for these cases than protobufs or flatbuffers.

Also, are you sure you aren't compressing them if you're using 15% CPU?


Yeah, you almost got it :) It's a simple backing store for a few jobs that create data feeds.

Agreed that JSON would be much worse. We have a hard requirement that all services must be Java based, and protobuf outperforms Gson and Jackson by miles.

I have been trying to find the cheapest serialization/deserialization I could find for cache purposes. So far, the best option is a Guava/Caffeine cache, because we skip the serialization completely in this scenario, but so much more costly than having a good external cache.


If you are willing to do some DIY-ing, using finance-like encodings may be for you: See "Simple Binary Encoding" (https://www.fixtrading.org/standards/sbe/). The encoding scheme basically just uses structs with version numbering and IDs. There may be some Java code out there for SBE if you want to pull it off-the-shelf.


Fantastic! Thank you, this is really useful!


This comes up a lot, do you have any benchmarks for protobuf deserialization that demonstrate that it is faster than json for the same structs? Protobuf seems much slower to decode due to the inability to know where an integer ends before one is done decoding it. Benchmarks I could find indicated protobuf decoding is maybe ~5x slower than json, but obviously this varies based on the payload and I'm not sure how much larger json encodings of the same data tend to be.


How would you know where an integer in JSON ends without looking ahead? You won't even know if it is an integer or not. It seems like a more expensive operation than protobuf.


You use about 2 SIMD instructions to compare 16 or 32 or 64 characters to whitespace, commas, and closing brackets, and one of these will be the character after the end of the integer literal. Theres no dependency between decoding the integer and tokenizing or decoding the next thing.


You can also use SIMD to parse varints (used by protobuf format).


I would question if the serialisation code speed actually matters.

RAM and CPU are much much faster than an SSD so almost all (98%) of the time will be spent writing to disk.

To speed it up you have to reduce the file size so that it can be written to disk faster. I think the fastest way would be a gzipped JSON file.

You can do all of the converting to JSON and compression while you're waiting for the hard disk to finish writing.


> RAM and CPU are much much faster than an SSD so almost all (98%) of the time will be spent writing to disk.

I wouldn't be so sure about it. Current SSDs can do gigabytes per second read and write. Sure, this is order of magnitude slower than theoretical sequential memory throughput, but might not be slower if you need to do some data processing on CPU or if you do memory access in an inefficient way. There are many ways you can screw it up and processing becomes the bottleneck.

For example if you need to compress data, you might be already not be able to saturate SSD (most compression algorithms don't go faster than 1 GB/s maybe except LZ4).

If you're not careful enough with heap allocations and you happen to create lot of tiny objects when deserializing, or you do lot of pointer chasing when serializing you may as well end up very far from theoretical maximum performance. I've seen systems maxing out at 10 MB/s just due to inefficient serde. Using plain text formats makes it somewhat worse.


I think this comment assumes very slow disks compared to the disks that are available.


100% is an odd choice for a utilization target.

Just about every service I've ever monitored has had some fairly clear inflection points where higher utilization starts to affect performance. It could be things like more time spent on garbage collection, or just unlucky collisions with async work that make things take longer.


Basic queueing theory says 100% load is not possible with any sort of variance in the arrival rate or service rate.

Realistically 70% is a good starting point for a max continuous load - then tune it. You might end up a bit lower or higher


Some of these examples are trivially equivalent to queues and queueing theory tells us that 100% utilisation results in unbounded queue lengths, and is therefore bad. The interesting thing would be to try to see if these other situations that on the surface sort of smell similar (serialisation and distribution systems, for instance) are actually isomorphic to the extent that the same proofs apply.


Now apply this same reasoning to, say, hospital capacity.

Some things need to optimize for the ability to absorb peak load, not steady-state operating costs.


Even hospitals make that trade-off. Every hundred years there are at least a few events that overload steady-state hospitals.

The important part is not building the capacity in, it's remaining flexible and adapting to what happens.


>The important part is not building the capacity in, it's remaining flexible and adapting to what happens.

At some point, flexibility requires extra capacity. You can't improvise when you're spending 100% of your resources on operations.


How is peak load defined in a hospital? (I assume you mean patients. I apologize if that is incorrect.)


> How is peak load defined in a hospital?

Same as everywhere else - largest demand on resources. For a hospital, the resources are staffing and equipment (rather than cpu/storage/ram). An MRI machine is cheapest when you have as many scans as possible for a given level of staffing - reducing the time per scan from 6mins to 5mins means that you can get 20 extra patients scanned per 8 hour shift (with one machine).

In support of the initial commenter's point ("Some things need to optimize for the ability to absorb peak load, not steady-state operating costs"), for hospitals, you need look no further than covid - horrendous health outcomes for routine things, because the capacity was soaked up dealing with respiratory virus. And if you disagree with that (you're wrong), look no further than a strong flu season - staffing capacity that is totally fine for a random august is nowhere near adequate. Or look at Saturday night in any Emerg - there's a huge peak in emergency medical situations from drunk people doing stupid things.


For a public health system you run the stats, using the $ cost of a life (or equivalent for a well life vs. sick life). Sounds sinister but ultimately they have so many $ and need to decide how to optimise that money.


>optimise that money

Perhaps the problem is defining a hospitals utility function as “optimizing money” rather than “optimizing a patient outcome”.

I agree that money is definitely a constraint, but I’m not sure it’s what should be optimized.


You do that for a non-public health system too, but you also figure in margins.


There will be different definitions for related concepts but I reckon the parent comment is referring to "surge capacity"


>developers or operators can use that slack to step in and handle unexpected load or resolve underlying issues before they become catastrophic or externally visible.

They can use 'slack' to pay off some technical debt that tends to build up when developers are always pushed to devote 100% of their time and effort to new features. There is often no time to go back and fix things that can bite you when the load reaches a certain level.


Yes I code, a few million+ SLOC in geophysics | imaging | GIS | et al, but more and more I consult on policy with companies and local to Federal government in Australia on the back of having an engineering | applied math background.

Agricultural and farmers might seem inefficient but they add national resiliency.

Pre COVID meat packing in the USofA may have been hyper efficient on paper but it broke hard under stress; too many concentrated assembley lines that put too many people too close together in too few places in total.

Day to day one of my favorite redundancies is the triple bowline for multiple anchors - under load if one anchor fails the knot rebalances load to other anchors and slowly contracts the "dead" loop with no sudden jerk.

[TEXT]: https://www.survivalworld.com/knots/triple-bowline/

[VIDEO]: https://www.youtube.com/watch?v=O81l4ss4Dqk


Many of these examples focus on the wrong kind of efficiency.

There's efficiency from the production perspective -- how large a fraction of time are my resources busy adding value?

Then there's efficiency from a consumption perspective -- how large a fraction of time am I receiving value?

The first kind of efficiency is inflexible, brittle, and associated with generated demand, clearly. What about the second kind? In my experience, properly optimising for consumption-side efficiency leads to more resilience, but I'm willing to be wrong here.


The articles starts out with the premise absolutely backwards on its head: efficiency is the exact opposite of constant 100% CPU utilization.

The more efficient the program is, the _less_ CPU resources it will consume to achieve its task.

Specially when that task is serving http requests. Unless you are always serving one million concurrent connections with one servers, you should never see chronic 100% CPU utilization.


The 100% CPU utilization is an odd one because, although theoretically it's more "efficient" to use a system at full capacity, performance tends to suffer and practically make it unusable (others' pointed here to queueing theory).

We are building a high load, low latency inference system, and we noticed that latency is correlated to CPU usage. Latency grows slowly linearly with CPU usage, up to a point where it starts to grow exponentially. We had a bug where a host was pegged at 100% CPU, and our tail latency suffered greatly, since any request that hit that host was basically toast.

Funnily enough OP could have used examples outside computing where efficiency directly trades off with resiliency. For example, in farming large monocultures are the most efficient, but a single disease can wipe entire plantations. In manufacturing, keeping stock is seen as an inefficiency, but problems in the supply chain can get you in hot water (as COVID made painfully obvious).


The only interesting challenge is non-trivially parallelizable multicore applications that need to share memory across threads.

And the only choice you need to make is arrays of 64 byte structures C OR Java.

In other words do you need to be able to sleep after a day of launching new feaures?

The way I tend to do these things is to prototype with Java and once the concept/protocol is finalized and fossilized; rewrite it in C atleast on the client.

Java is the best language for server platforms, with classloader hot-deployment for sub second global turnaround and no crashes VM+GC.

About services being infinitely accessible: companies that have non-trivial solutions are now moving away from this with queues.

Limit the number of concurrent users and when saturated customers have to wait.

Open-source/source-available is the best way to scale though. If people are prepared to pay, others will run you service.

You need to allow others to make money though. I recommend revenue scaled monthly licence fees recurring on something like gumroad.

Finally I'm pretty sure the new 4nm chips will proove to be more fragile than the 14nm stack, only alot of 100% CPU time will tell.


Maximally efficient is minimally robust.


Catastrophic failure is pretty bad for efficiency. Over any serious time horizon, being maximally efficient means finding the optimal level of robustness, given the likelihoods and consequences of possible failures and the costs involved in preventing or mitigating them.


> Catastrophic failure is pretty bad for efficiency.

For the system, but not necessarily for any individual actors in that system.

There is no inherent force that automatically keeps those incentive aligned.

Everybody knows that the supply chains are brittle. Covid, earthquakes, war, etc. all can disrupt them. Yet, do you see anybody holding inventory? Do you see any diversification occurring (very minimally--and mostly to shift from cheap Chinese labor to cheaper Vietnamese/Indonesian labor)? Do you see anybody buying personal protective equipment domestically in the US (nope, everybody went back to buying cheap Chinese crap)?


The problem is the incentives of the decision makers often don’t align with that time horizon.

Consider a CEO who is rewarded by quarterly outcomes rather than how healthy the company will be in two or three decades. Or a politician who proposes a short-term policy that looks like a short term win but will undermine constituents after they are long out of office.


This also applies to supply chains and economies in the whole.


Indeed, when I saw the headline, I thought this was going to be commenting on the various supply chain crashes triggered by the pandemic's sudden changes, after decades of cleverly squeezing slack out of shipping, ports, warehouses, and logistics coordination.


And biology and ...


However, the comparison about efficiency to resiliency against small teams is something I havn't thought about.

Interesting point about declaring a system is wasteful and inefficient without considering what that 'waste' is buying.


and at this point using an autoscaling solution, provided by a big cloud service is ideal.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: