> The explanation is rather simple - hardware is always "on the premise", yours ...

babo · on March 16, 2015

I would like to know the cost calculation after a year or two. With a handful of servers it's easy to get the false impression that HW failures are rare.

kovyrin · on March 16, 2015

Oh, there wasn't a handful of servers after we finished the migration (we have migrated a bit late IMO, so we had a lot of traffic even back then). And today, with much larger infrastructure, with hardware clusters specifically tailored to our customers needs, etc I'm pretty sure the same infrastructure on EC2 would cost more than 2x.

(Update) Re: failures - with a ~50 servers we see a hardware issue (disk dead in a RAID or an ECC memory failure) about once a month or so. None of those failures caused a single outage (RAID and ECC RAM FTW) so far.

nasalgoat · on March 16, 2015

I ran several dozen Dell blade enclosures fully maxed out - well over 300 server blades - and in 3 years I had two disk failures, none of which were critical. Hardware is pretty reliable these days.

babo · on March 16, 2015

How do you monitor HW and network failures and how do you notify SoftLayer? Is that 1-2 hours replacement time true for each components of your server fleet?

kovyrin · on March 16, 2015

1-2 hours is their new server provisioning time. For HW issues we use nagios (that checks raid health and ECC memory health regularly) and at the moment we just file a ticket with SL about the issue showing them the output from our monitoring. They react within an hour and HW replacement is usually performed within an few hours after that (usually limited by our ability to quickly move our load away from a box to let them work on it).

phil21 · on March 16, 2015

HW failures are rare. At least hardware failures that matter. Disks in a RAID set dying or redundant power supply failures are not critical, and even those are more rare than you would generally expect them to be. With a bit of standardization it's incredibly cheap to keep a pool of spares handy and RMA the failed components at a leisurely pace.

Plus, you're still engineering your applications to be just as fault-tolerant as if they were running in cloud, right? The only difference is you are not paying the virtualization overhead tax. A single server dying should leave you in a no less redundant state than a single VM dying. They should also be nearly as easily deployable.

This is based off my personal experience in datacenters with 5,000-10,000 installed servers. Anything other than a PSU or HDD failure is exceedingly rare.

Negitivefrags · on March 16, 2015

We have 100 physical servers and hardware failures really are very rare. Very rare.

In fact over 4 years we have only had 3 hard drives fail and no other hardware failures.

geoka9 · on March 17, 2015

Do you have any plans on replacing your hard drives as they get old? Or you just wait for them to fail?

gpapilion · on March 16, 2015

Generally I'd say the extra cost savings come from the lack of software needed to support the amazon style apis. They may have also made a multi year commitment, which would also further drive the cost down.