Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have managed my own DC and colo'd. Some examples of issues that occurred:

- Colo: unauthorized personnel tripped over a cable, killing network to a lot of our servers. This was in a huge colo hosting facility that was home to some fortune 500 companies' servers.

- Own datacenter: local internet providers both had a planned maintenance/outage on the same day. This isn't "mowing the lawn in front of your DC", this is the most basic utility after power that needs to be available for your DC to work.

- Colo: hard drive failure after hours. Colo security staff wouldn't let us disassemble hardware that wasn't ours. Had to wait 6hrs for a technician to show up to pop and swap a drive (this with a 1hr incident response in our contract and the same "highly reliable" colo; we got money back per our contract, but not the customers we lost, even after apologetic/refunding communications to them).

- Own datacenter, third-party NAS appliance professionally installed by a vendor: persistent performance issues and service dropouts, eventually traced to the "professional installation" being at the bottom of a rack with a large, detachable rear panel mostly covering the fan intakes for the appliance, leading to persistent throttling due to heat.

- Colo (rented hardware, managed OS install): When the business started, our sysadmins/DBAs (me and one other guy) were primarily experienced with RHEL and old school init. The colo only provisioned Debian/systemd servers. We learned, but it slowed us down for a week or so. Sure, it was only an extra few minutes per task, but it added up.

- Own datacenter: management needed us to move into a new server room because of an ending lease, with a hard deadline. The air conditioning installation vendor showed up two weeks late; we had no cooling at all when we needed to cut over, causing days of service interruptions and downtime.

I have also managed cloud services. Some examples of issues that occurred:

- Amazon's S3-gate. That sucked. When it happened, we were able to email our customers a copy of Amazon's status, and links to reporting of exactly how wide-spread the issue was. We had impact nearly identical to the colo hard-drive-failure incident above (it was the same service, migrated to AWS/S3), but we didn't lose nearly as many customers.

- The DNS DDoS that affected the East Coast of the US earlier this year. Same kind of service interruption, same communication. We even got replies back from customers saying "I couldn't get to Reddit either; I figured you guys might be in the same boat".

I'm not a cloud evangelist. I think that there are very good reasons to host entirely locally, or colocate/rent, or anything in between. I do, however, think that businesses, especially small ones that are not typical software-centric startups, massively and regularly underestimate both the initial and ongoing costs of running their own infrastructure. There are significant technical benefits to being cloud-hosted (these benefits also apply, only a little less potently, to fully managed hosting operations a la Rackspace), but people often miss the political and financial benefits. The political benefits are things like "our customers are less pissed because widespread issues with AWS probably affected them in other ways that day as well, so not as many people will knee-jerk blame us". The financial benefits are primarily that there are fewer moments of the "oh shit, setting these servers up ended up costing way more than we thought it would", and "we thought this would take a day to turn up; it actually took two weeks" varieties. Those things still happen with cloud services, but much less often.



To add fuel to your fire, I've personally been exposed to supporting the DC infrastructure in multiple companies and let me tell you, it's no picnic. The time and money we spent on maintaining the system was outrageous. Getting vendors to come fix their systems or tracking down licenses for software took days and days. Even finding replacement hard drives that were compatible with our 3 year old servers was a giant task that consumed several team resource's for hours at a time.

These are the things that DC/Colo people don't talk about and conveniently forget when it comes to cloud systems. I would gladly give up that "control" for steady and predictable futures.


Interestingly these examples only cover colo and own data centre. What about dedicated servers? You get a similar SLA as with instances in the cloud at substantially lower prices.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: