My take at making AWS EC2 cheaper by automating SPOT instances with AutoScaling

vgt · on April 21, 2016

I'm going to plug Google Cloud's Preemptible VMs as a simpler alternative to Spot Instances:

- Preemptible VMs are sold at a fixed 70% off discount, removing pricing volatility entirely

- Google Cloud's Compute Engine has far fewer VM types, thus making it much easier to construct the resources you want (exception being GPUs, you don't need a "network optimized" instance to get fast network, nor you need a "storage optimized" instance to get fast/large storage - these things are modular on GCE).

(disc: work on Google Cloud)

gcr · on April 21, 2016

I'm still waiting for any serious competitor to come along with GPU instances. Google Cloud doesn't have them, Azure still hasn't made them available despite announcing them over a year ago, and other cloud providers make you call a salesperson to get a quote on a service that's much more expensive than what you could run yourself with a trip down to Best Buy.

Amazon is still eating everyone's lunch on the GPU front.

vgt · on April 21, 2016

Good point on IaaS front.

One major use case for GPUs is ML. Google Cloud externalizes its ML through serverless APIs. At that point what matters is your ability to derive value from ML, rather than having exposure to building blocks.

lmeyerov · on April 21, 2016

That's an anti-competitive position for an IaaS to take. "We already compete with your kind of software here, so we won't sell you the hardware"?

I work on some analytics software that includes GPU ML algorithms & non-ML GPU algorithms, neither of which Google makes. Our peers do a lot of visual computing in AWS. Really weird thing to hear from a Google rep.

boulos · on April 21, 2016

Please don't take it that way, Leo! How about:

"Yes, we know! Sorry! If you just need some ML things and can use the Cloud ML or Cloud Vision services we've got something to tide you over.

If not, we're always excited to get direct 'I want to do X' feedback that we can translate into 'Customers are demanding Y, and willing to pay Z'".

Please don't take the comment to mean we're trying to box people out of this space. We're not. We do think (most? many?) people don't want to actually roll their own, but we love everyone.

Disclosure: I work on Compute Engine.

vgt · on April 21, 2016

what he said :)

merb · on April 21, 2016

Actually it would be great if your lower instances would have more memory. For low cost projects (small companies or society club) it's really odd to pay the server on something like ovh.de (more memory cheaper, nearly same performance) and the storage for GCE or AWS. I mean yeah I just realized that GCE could be cheap, still somehow the Micro instance could've had a little bit more memory. And small is already out of budget. I mean something between Small and Micro would be great, like 1.2 GB Memory with a price of 6.6 USD (which would be 7 USD with a 10 GB persistent disk).

boulos · on April 21, 2016

Does this mean your target budget is $10/month? (I'm just asking for clarification, and to understand how you think about it).

Second, what are you looking to run? A Java-based web app? Does App Engine's free tier not serve you better?

Finallty, and it's perhaps poor form to point this out, but we (and AWS) will end up blowing out your budget once you include networking while OVH/Hetzner/Digital Ocean won't. Compute Engine isn't a VPS, we're giving you a small slice of a machine but it's connected to a crazy awesome network that we bill for by the byte. When you compare that to a bundled, heavily-overcommitted across tons of customers VPS networking offering the dollars don't work out.

Disclosure: I work on Compute Engine and care a lot about our pricing.

merb · on April 21, 2016

Yes the target budget is max 10€/month (so a little bit more than $USD) It's a Java8 Play Framework application currently the App Engine doesn't support that and I had problems setting up the new Engine since my project uses sbt and I didn't had "enough" time to invest. actually the application won't need that much network. it needs 'nearly infinitiv' storage (non durable) and sometimes use preemtive vms to convert them (this could go over budget, but thats ok).

it's just that I need at least one vm that serves the web, this will upload the images to the object storage then it will fire premeetive vms and converts them, this will put the images in the real object storage and calls the always on vm to insert something in the database (i.e. image path) actually the one vm needs to contain a small database (we are talking about mb so memory isn't an issue) and a java app. so as said micro is too small and small is too big for that always on vm.

my current planned setup is ovh instance -> gcloud object storage -> preemtive vm (if there is one) -> gcloud object storage - ovh instance

btw. I made some calculations and with the 3.4 € from ovh and somtimes preemtive vms and something like 30g/10g object storage (which hopefully will raise, but than again no problem on the costs) will cost us less than 7€ + Domain (which is billed per year) btw. the preemtive vms runs something like 48 hours per month (mostly less since the most images will started between november < - > february)

Edit: btw i wanted to use the datastore (database) aswell, but I couldn't activate it in a project without a app engine.

_oe2s · on April 21, 2016

Hopefully this comment doesn't sound too stupid but is it at all possible (or likely to be implemented) to get a worse network on GCE and pay less for it?

I ask because I don't need a CDN level network like GCE provides but I do need a ton of compute and a ton of bandwidth for batch data processing on data hosted at another provider (and not a candidate for GCS). A pre-emptible network offering would open up the way for me to use compute resources, which is surely a good thing.

boulos · on April 21, 2016

Not stupid at all! A cheaper, lesser networking offering is a common request, so you're not alone.

Can you describe more about the "data hosted at another provider"? Would one of our interconnect options (https://cloud.google.com/interconnect/) help? If it's at AWS would a pair of interconnect options help (DirectConnect to Equinix in VA, then Carrier Interconnect to us).

_oe2s · on April 21, 2016

Just to say it upfront, I'm nothing like big enough for a lot of the options you have available.

> Can you describe more about the "data hosted at another provider"?

It's all at OVH and Hetzner. They offer very cheap storage and bandwidth.

> interconnect

I'm not big enough for those options and regardless, all of your peering/interconnect options are CDN priced for inter-region and only slightly better for intra-region. They just don't make sense.

Say I want to store 100TB of data and pump it through some kind of processing pipeline, outputting the same amount of data. At OVH I pay ~$0.0075/GB on storage coming to ~$1.5k/month (since the processing results in double storage). Say my processing is light on memory and can run at 1MB/s/CPU and I want to be finished in a day. I need 1150 CPU cores for 24h and at OVH that would cost me ~$1300 and bandwidth is free and unlimited. At GCE I can run on pre-emptible instances and pay ~$284 for the compute but I have to pay for 100TB of premium egress bandwidth I don't need, which adds 100 * 1000 * 0.04 (generously assuming I can get peering/interconnect) which is $4000. GCE is absolutely unusable for jobs like this in its current state.

Sure I could move data to GCS but in that case my $1.5k bill for storage turns into a $5.2k bill per month.

merb · on April 21, 2016

actually i would use the managed vm's or 'App Engine Flexible Environment' but actually it's not possible inside the EU. I tested that and I get an error on deployment.

However after creating a new project everything in gcloud started to work. Does the App Engine free tier apply to 'App Engine Flexible Environment' ? That would be great since that would make the site really really cheap.

kiwidrew · on April 21, 2016

Why would you bother going to all this trouble when you could just use Amazon's official Spot Fleet service/API?

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-flee...

dmourati · on April 21, 2016

He started before the feature was released and acknowledges they came up with an in-product solution:

"and it seems they now have a full fledged solution for the problem, based on pretty much a reimplementation of AutoScaling, using machine learning and with a beautiful UI and they are really successful with it. Funnily enough, they even contacted me to sell that solution to my company and we are seriously evaluating it"

chanakya · on April 21, 2016

He's talking about Spotinst.com, not Amazon's fleet solution.

kiwidrew · on April 21, 2016

Ah, okay, I missed that part!

alien_ · on April 21, 2016

Yes, but I actually did start a few weeks before the spot fleet was launched.

The problem with the spot fleet 1) it's kind of awkward to use 2) it has statically defined capacity so you can't scale it 3) it has a static bid price, so if at some point your are outbid on all the group's bids, you end up with no capacity 4) among other things it lacks integration with the ELB so you can't really use it for so many use cases.

My solution is simpler, better integrated with the rest of AWS and more resilient and once I iron out the bugs and get it production-ready, it should be a better choice.

falsedan · on April 21, 2016

Spot prices can spike above on-demand prices for some instance types (e.g. g2.2xl). If you wanted to maintain a certain capacity of homogeneous machines, you'd have to notice the failed spot instance request & increase the size of your on-demand ASG to compensate.

asteadman · on April 21, 2016

Can anyone comment on the crazy spot prices for g2.8xl ? they spike really high (they were $26/hr earlier? 10x the on-demand price. I'm guessing someone with enough market share has a job that doesn't want to be interrupted and they bid the spot price up to 10xon-demand, which seems ridiculous since presumably they could arbitrage /w an on-demand, but i guess that once they start the job they don't want to be interrupted? i'm not exactly familiar with the intricacies of the ec2 spot market. ). Also, are they any good for ML? I'm between machines ATM and would like to experiment with some deep CNN for face recognition. At $2.6/hr for on-demand its a bit more than I would like, but if the spot price was less i'd consider it.

dharma1 · on April 21, 2016

I used them for a while for ML, works ok. I also noticed the these spot price spikes with both g2.8xl and g2.2xl. Not really sure why someone (or multiple players, since the prices spike) have set their max bid so high, surely it's worth running an on-demand instance at those prices.

Ended up going between two regions to avoid them but it's some hassle, just got a gtx970 in the end.

IanCal · on April 21, 2016

Can you checkpoint your work? Most libs I've used support that generally. That way you only lose a bit of time if kicked off, and you don't pay for the hour you get kicked off so it's even cheaper.

ecesena · on April 21, 2016

One issue that we had in the past was the following - hope this is resolved now.

Say you have an autoscaling group spanned across zones A and B, and say you have 1 machine in zone A and 1 in machine B.

Now, the price in zone A goes up, and your machine in zone A dies.

The issue (bug?) was that the autoscaling group was trying to re-instantiate a new VM in zone A. Of course, since the price was high, the new VM was basically immediately dying. And so on.

Edit: issue apart, it's a great way that can save money, especially if you have a group of VMs whose computation can be interrupted/restarted relatively cheaply.

alien_ · on April 21, 2016

With my approach the AutoScaling group would always replace failed instances with the on-demand ones identical to those initially defined on the group's launch configuration.

All I do is I later attempt to replace them with whatever I can buy from the spot market.

On the other hand, the spot bidding implemented out of the box in AutoScaling will fail if you are outbid in all Availability Zones at the same time, since it doesn't fall back to on-demand instances. I've seen people often use a second on-demand AutoScaling group that would scale out when you get outbid on the spot one, but then you have a problem defining scaling policies so that they can scale nicely, and/or shifting the capacity between them. Someone had a nice talk at re:invent about how they do all that.

samlin86 · on April 21, 2016

I agree that was an annoying issue. They've recently changed it though, such that it will retarget unfulfilled bids for spot instances in a different zone in order to reach your desired capacity. Win!

See: https://forums.aws.amazon.com/ann.jspa?annID=3647

flaviotruzzi · on April 21, 2016

Looks like https://github.com/chaordic/tiopatinhas

alien_ · on April 21, 2016

I've looked in a bit more detail and it seems to attach the nodes to the ELB used by the AutoScaling group, while I am attaching them to the group itself, which indirectly adds them to the load balancer.

I'm curious how does that tool handle the scaling of the group, since the nodes are actually outside the group and can't contribute to group-wide metrics like average CPU usage, often used for scaling out.

alien_ · on April 21, 2016

I didn't know about that one, indeed, it seems pretty similar.

voltagex_ · on April 21, 2016

I've got https://github.com/voltagex/junkcode/tree/master/Python/spot... and https://github.com/voltagex/junkcode/tree/master/CSharp/spot... but I was never really happy enough with either to take them past my junkcode folder.

If I could make the Python one work a bit better, I'd probably use it with Ansible - http://snappishproductions.com/2014/03/24/Spot-Instances-Wit...

x5n1 · on April 21, 2016

Funny Amazon acquired a startup for millions that did just this sort of thing.

voltagex_ · on April 21, 2016

Who knows, maybe my junk code is gold!

boulos · on April 21, 2016

Why do only update / rebalance every 30 minutes? Is that because of the per-hour billing (so replacing an on-demand one only part way through its life is a mistake). If so, it still seems like you'd want to inspect every 5 minutes or so (keep on-demand until it reaches > 60-frequency).

Disclosure: I work on Compute Engine (and launched our preemptible VMs product).

alien_ · on April 21, 2016

It's because so far I am only considering replacing the nodes slowly, one at a time, mostly in order not to hit any soft limits that may be defined on the account(like total number of instances), but also in order to give the user the chance to stop it in case things go south for whatever reason during the evaluation, since as I warned, this thing is likely full of bugs at this point.