More

thockingoog · on Jan 27, 2025

As a maintainer: some issues take longer to triage than others. Especially if they are not CRITICAL, and there's a huge holiday season in the midst of it. :)

I know I have been involved in a couple which took time to agree on a "best" solution and to find people to tackle them.

thockingoog · on April 7, 2022

There's a perception that this is true, and like all perceptions it is based in reality, but it is not the reality itself.

We have plenty of data that shows that people DO get promoted at all levels based on OSS work.

thockingoog · on April 7, 2022

Define "killing kubernetes"? It's still pretty successful and the adoption hasn't slowed in any way I can measure. I promise you that some site you used TODAY is running, at least part of it, on Kubernetes.

Google employees regularly get promoted, at all levels, based on their OSS work - Kubernetes and other projects. We have dozens of people who work on Kubernetes, in one area or another, at varying degrees of depth. Is that "killing" the project?

Of course it is never ENOUGH. I'd happily consume hundreds more people. :)

thockingoog · on Feb 14, 2022

This was, literally, one of the arguments for building and releasing kubernetes. The rise of Hadoop made it much harder to justify MapReduce being different.

If we just talked about Borg, but didn't ship code, someone else might have set the agenda, rather than Kube.

thockingoog · on Feb 24, 2021

(Building on my own tweets)

Autopilot and Fargate are VERY different solutions to similar problem statements.

Autopilot puts compatibility & transparency at the front. It IS GKE. It is integrated in all the same ways GKE is integrated. There's no black box between you and kubernetes, but you are absolved of the need to manage nodes, which most people REALLY don't want to care about.

Like it or not, nodes are part of the k8s API in many ways. Rather than swim upstream against that, I think Autopilot strikes a very good balance.

thockingoog · on Sept 5, 2020

Since this started by citing me, I feel somewhat obligated to defend my guidance.

I stand by it.

In an ideal world where apps are totally regular and load is equally balanced and every request is equally expensive and libraries don't spawn threads, sure. Maybe it's fine to use limits. My experience, on the other hand, says that most apps are NOT regular, load-balancers sometimes don't, and the real costs of queries are often unpredictable.

This is not to say that everyone should set their limits to `1m` and cross their fingers.

If you want to do it scientifically:

Benchmark your app under a load that represents the high end of reality. If you are preparing for BFCM, triple that.

For these benchmarks, set CPU request = limit.

Measure the critical indicators. Vary the CPU request (and limit) up or down until the indicators are where you want them (e.g. p95 latency < 100ms).

If you provision too much CPU you will waste it. Maybe nobody cares about p95 @50ms vs @100ms. If you provision too little CPU, you won't meet your SLO under load.

Now you can ask: How much do I trust that benchmark? The truth is that accurate benchmarking is DAMN hard. However hard you think it is, it's way harder than that. Even within Google we only have a few apps that we REALLY trust the benchmarks on.

This is where I say to remove (or boost) the CPU limit. It's not going to change the scheduling or feasibility. If you don't use it, it doesn't cost you anything. If you DO you use it it was either idle or you stole it from someone else who was borrowing it anyway.

When you take that unexpected spike - some query-of-doom or handling more load than expected or ... whatever - one of two things happens. Either you have extra CPU you can use, or you don't. When you set CPU limits you remove one of those options.

As for HPA and VPA - sure, great use them. We use that a LOT inside Google. But those don't act instantly - certainly not on the timescale of seconds. Why do you want a "brick-wall" at the end of your runway?

What's the flip-side of this? Well, if you are wildly off in your request, or if you don't re-run your benchmarks periodically, you can come to depend on the "extra". One day that extra won't be there, and your SLOs will be demolished.

Lastly, if you are REALLY sophisticated, you can collect stats and build a model of how much CPU is "idle" at any given time, on average. That's paid-for and not-used. You can statistically over-commit your machines by lowering requests, packing a bit more work onto the node, and relying on your stats to maintain your SLO. This works best when your various workloads are very un-correlated :)

TL;DR burstable CPU is a safety net. It has risks and requires some discipline to use properly, but for most users (even at Google) it is better than the alternative. But don't take it for granted!

thockingoog · on March 4, 2020

Free (zonal) cluster per account, regardless of size, should cover a lot of this, no?

hughpeters · on March 4, 2020

Great point! Didn't catch that one. I'll edit my comment

thockingoog · on March 4, 2020

Those empty clusters that you get for free cost Google money. Perhaps it never should have been free, because that skewed incentives towards models like this.

p_l · on March 4, 2020

Unfortunately, even if they switch to dynamically started clusters, the latency of spinning a new cluster is much higher than the latency of adding a bunch of preemptible nodes to existing node pool :/

yebyen · on March 4, 2020

Google are (were) not the only ones offering this free control plane model, though. My DigitalOcean DOk8s managed tend toward unstable if they are used with too small of node pools. (I don't know why that is, but it seems like a good way to make sure I pay attention to the workloads and also spend at least $20/mo for each cluster I run with them.)

It will be interesting in any case to see if DigitalOcean and Azure are going to follow suit! I'd be very surprised if they do, (but I've also been wrong before, recently too.)

zerotolerance · on March 5, 2020

The term is "loss leader." GKE provides the manager node, and cluster management so that we don't have to. And in exchange you sell more compute, storage, network, and app services. This is some ex-Oracle, "what can we do to meet growth objectives," "how can we tax the people who we own" thinking. They're customers, not assets Tim. Your cloud portability play should be the last project to jerk them around on.

twistedpair · on March 5, 2020

Keep in mind that GKE cluster management was paid in the original GKE. GCP only stopped billing for cluster management when EKS released free cluster management.

aphistic · on March 5, 2020

When did EKS release free cluster management?

thockingoog · on March 4, 2020

https://news.ycombinator.com/item?id=22487110

rcconf · on March 4, 2020

Honestly I understand the hard work it takes to manage all the clusters, but this was a total bait and switch and hurts the reputation that everyone has with Google Cloud. Telling us to DIY because we cannot pay $71 just sounds like someone who works at Google would say, which you do work at Google.

The sentiment with my clients before was that Google Cloud was a great choice because of the security and expertise with GKE. It's also free!

Meanwhile, in the back of my head I've always had this fear because of your reputation that you do not keep your promises and that you do not care about your users. Because of this fear, we have tried to make every infrastructure decision not use a managed service by Google even though it may be easier to do so short-term.

For the product I'm working on, we decided to use Kubernetes just in case you baited and switched us with the reputation you have. In terms of monitoring, we really wanted to use Stackdriver, but now we're 100% using fluent-bit + prometheus + loki + grafana. It's the only way to protect ourselves from your reputation which is becoming a reality.

So yeah, this is pretty sad and a bad decision. Should have priced GKE at $70 / month to begin with and we would have been fine with it. Now we're (actually) looking at EKS since Amazon doesn't seem to have this reputation and you've spooked us. We never would have thought about using any other provider until today.

thockingoog · on March 4, 2020

I understand the emotional response here, but I don't think it's rational. GKE has to work as a business, or else the whole thing is in trouble.

I think GKE provides tons of value, but people tend to under-estmate that. In order to keep providing that value, we need to make sure it is sustainable.

I'm really, truly sad that you perceive it as bait-and-switch, but I disagree with that characterization. If you want to move off GKE, I'll go out of my way to help you, but I urge you to take a big-picture look at the TCO.

atombender · on March 5, 2020

To be fair, it's unusual for a product at this scale to go from free to paid. It's also unusual for it to happen to a product which already went from paid to free once before.

I don't agree with the parent that it's a bait-and-switch, but I also don't think what's happening is an emotional response. For many people and companies, clusters being free have been a feature of Google Cloud. Making it a paid feature completely changes the dynamic.

It's an unexpected announcement that will further sour sentiment about Google as a company. It's really hard to build trust in this industry, and it's really easy lose it. Google has this thing about announcing changes that blow up negatively on HN, and could learn from this.

(For the record, I'm a big fan of Kubernetes, and I like GKE a lot.)

solidasparagus · on March 4, 2020

This kind of mentality is why Google is struggling. You forget that your customers are human and make emotion-driven decisions. This price increase proves that you are not making sustainable long-term decision and you are willing to dump the cost of that mistake on your customers.

We already don't trust Google to provide long-term, stable, reliable infrastructure and each time something like this happens, we become more convinced that Google isn't trustworthy.

ohyeshedid · on March 4, 2020

I think part of the optic's issues is your peers seem to be offering similar services for free, while being sustainable.

thockingoog · on March 4, 2020

EKS has always had a fee.

AKS, well, I don't have any insight into their business, but I have my suspicions.

thockingoog · on March 4, 2020

https://news.ycombinator.com/item?id=22487110