Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The explanation is rather simple - hardware is always "on the premise", yours or Amazon's. Someone needs to swap drives, motherboards, man the networking gear, run cables, etc.

> So you're paying Amazon to do the same work you would do otherwise - only you're subject to their rules and procedures and Amazon being a profitable business needs to mark their services up.

But I thought that they were paying Softlayer to do that stuff instead of Amazon. They're not doing it themselves - and yet it's still cheaper!



I would like to know the cost calculation after a year or two. With a handful of servers it's easy to get the false impression that HW failures are rare.


Oh, there wasn't a handful of servers after we finished the migration (we have migrated a bit late IMO, so we had a lot of traffic even back then). And today, with much larger infrastructure, with hardware clusters specifically tailored to our customers needs, etc I'm pretty sure the same infrastructure on EC2 would cost more than 2x.

(Update) Re: failures - with a ~50 servers we see a hardware issue (disk dead in a RAID or an ECC memory failure) about once a month or so. None of those failures caused a single outage (RAID and ECC RAM FTW) so far.


I ran several dozen Dell blade enclosures fully maxed out - well over 300 server blades - and in 3 years I had two disk failures, none of which were critical. Hardware is pretty reliable these days.


How do you monitor HW and network failures and how do you notify SoftLayer? Is that 1-2 hours replacement time true for each components of your server fleet?


1-2 hours is their new server provisioning time. For HW issues we use nagios (that checks raid health and ECC memory health regularly) and at the moment we just file a ticket with SL about the issue showing them the output from our monitoring. They react within an hour and HW replacement is usually performed within an few hours after that (usually limited by our ability to quickly move our load away from a box to let them work on it).


HW failures are rare. At least hardware failures that matter. Disks in a RAID set dying or redundant power supply failures are not critical, and even those are more rare than you would generally expect them to be. With a bit of standardization it's incredibly cheap to keep a pool of spares handy and RMA the failed components at a leisurely pace.

Plus, you're still engineering your applications to be just as fault-tolerant as if they were running in cloud, right? The only difference is you are not paying the virtualization overhead tax. A single server dying should leave you in a no less redundant state than a single VM dying. They should also be nearly as easily deployable.

This is based off my personal experience in datacenters with 5,000-10,000 installed servers. Anything other than a PSU or HDD failure is exceedingly rare.


We have 100 physical servers and hardware failures really are very rare. Very rare.

In fact over 4 years we have only had 3 hard drives fail and no other hardware failures.


Do you have any plans on replacing your hard drives as they get old? Or you just wait for them to fail?


Generally I'd say the extra cost savings come from the lack of software needed to support the amazon style apis. They may have also made a multi year commitment, which would also further drive the cost down.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: