"by nbashaw 28 minutes ago" I thought for sure I'd have missed it and this would...

chaz · on Jan 31, 2013

eBay's outage in 1999 was for 22 hours [1]. That painful experience completely changed their entire internal engineering process.

[1] http://www.internetnews.com/ec-news/article.php/137251/Cost+...

_euac · on Jan 31, 2013

12:18 PST, still down. Someone must be getting his ass chewed about this.

garretruh · on Jan 31, 2013

Fired is more like it. Though I doubt Jeff Bezos is the kind of guy to do something like that.

potatolicious · on Jan 31, 2013

I used to work for Amazon. Unless this was willful I doubt anyone is going to get fired for it. After all, you'd be firing the guy who's least likely to make such a mistake ever again.

mattmaroon · on Jan 31, 2013

It's so silly to fire people for mistakes that no organization that exerts any mental capacity toward human resources would do such a thing. Even Apple didn't fire the guy who left a prototype at the bar.

In a factory in 1965, maybe, but no good employer is going to fire someone for making a mistake, no matter how costly.

stcredzero · on Feb 1, 2013

> In a factory in 1965, maybe

Depends on the factory, and I doubt it's much more or less likely in the 21st century than it was in 1965. In all times and places, you have enlightened and unenlightened people. In all times and places you have good and bad leaders.

"In every time, in every place, the deeds of men remain the same."

scaphandre · on Feb 1, 2013

> "In every time, in every place, the deeds of men remain the same."

Citation needed. I'm pretty sure there are fewer war deaths and fewer kittens burned for entertainment per head now than 200 years ago.

stcredzero · on Feb 1, 2013

It's a quote. I leave that as homework.

Yes, there is less bad stuff, but the quote says nothing of frequency.

kami8845 · on Jan 31, 2013

The story of the plane servicer who never mis-serviced a plane again after almost killing the pilot he worked for is cute, but screwing up big-time doesn't turn a (presumably) sloppy engineer into one that never messes up again.

potatolicious · on Jan 31, 2013

I'm of the mind right now that the word "cute" really shouldn't be used in any other context than physical description. It's just condescending and rude.

The problem with your attitude is that it's based upon a premise that is almost never true: that screwups are caused by incompetence, and that they have singular (or overwhelmingly singular) sources.

Neither of these assumptions bear out in reality, and certainly not in our industry.

The vast majority of downtime events trace back to systemic failures, not a freak event, and are more often catalyzed by momentary lapses than long-standing incompetence. Do we penalize the tech who clicked the wrong link on a dashboard, or the guy who wrote the dashboard such that a critical action contains no safeties or confirmations? Or do we penalize the manager for not having any established documentation on protocols surrounding triggering critical actions?

The only reasonable stance here is to collectively take responsibility for the failure. It may feel good to hang someone out to dry, but in all likelihood their failure was only the final link in a long chain of failures that extended well beyond themselves.

You root cause what led to the event (going deeper than "a tech clicked on the wrong thing"), and you fix the root cause, and you move on.

jzwinck · on Jan 31, 2013

To some extent I agree with you, but it is a slippery slope. If we always deny that one person really is a problem, we may retain a truly bad employee while building excessive safeguards that hinder productivity for others. In my experience this possibility is all too real.

A team of good people should learn from their mistakes and reduce hazards along the way. But bumper bowling is no fun for experienced players. It's a balance, and it does tend to shift as a company grows.

cl8ton · on Jan 31, 2013

12:22 PST still down... Looking for Mushroom cloud to the NW

farnsworth · on Jan 31, 2013

12:26 PST and no problems accessing from Seattle.

bink-lynch · on Jan 31, 2013

12:34 PST and OK from Vegas.

dos1 · on Jan 31, 2013

On the plus side, if they're willing to share, I bet this will be a very interesting postmortem. Presumably Amazon.com is one of the more bulletproof web properties in the world. Whatever could have occurred to take it down for nearly an hour (at this point) can only be interesting!

nirvana · on Jan 31, 2013

I can't compare to other web properties, but when I worked at Amazon, the store going down was a regular event. Something broke almost daily, though it was rare for the whole store to go down. (EG: You might not be able to search, or checkout might be down, etc.)

The store went thru periods of relative stability, and relative lack of stability, and in the periods where it was not doing so well, it (or a major piece of functionality) would go down in some key area at least once a week, sometimes multiple times a week during the holidays.

While it's been several years and I'm sure they've improved reliability, the sheer mass of the store made it very slow to evolve. And as an ex-amazonian sometimes I go and check for bugs that were issues back in the day- several of them have come back over the years, which is not surprising given that the entire group that was working on the parts I was working on disbanded because so many people were driven off by bad management. (A one-two punch in that case, a bad manager backed by another bad manager, neither of which had any technical knowledge.)

At the time I worked there, large swaths of code in the store had no team who was responsible because the team had been disbanded in one of the regular shuffles of employees. Amazon had a tendency to get a team together to do a feature, launch it, get the PR and the stock bump, then disband the team and put them on other projects. Of course some of these things stuck around if they were successful, but there was a lot of cruft from past efforts like: Local restaurant menus, the movie times system, various "social shopping features" (a perennial favorite to try again and again.) Hell, they used to have catalogs for mail order merchants- scanned paper catalogs!

At the time, they were claiming that "AWS is what we built the amazon store on!" (which was totally false, S3 was engineered completely separately from the store, and to its credit, as obidos and gurupa were crap. The only thing the store shared with AWS for at least the first several years was being hosted in some of the same datacenters.)

At least at the time I worked there, I'd call it a mess held together by the code equivalents of duct tape and bailing wire.

One of the things Amazon excels at is customer service, so when these problems would impact the customer, their bacon was often saved by customer support fixing the problem manually (eg: messed up orders, etc.)

Granted, operating at Amazon's scale is not trivial matter. But Amazon is a retailer and stock marketing company (Eg: one of their primary products is Amazon stock), more than an engineering company.

I'm kinda amazed that people perceive them as a "tech giant" along with Google, Facebook and Amazon. Shows the power of a good (actually, GREAT) side business like AWS. They get the credit for building something good and scalable with AWS, but of course it was a separate team lead by a senior executive with enough political clout to shelter that team.

gokulk · on Jan 31, 2013

'I'm kinda amazed that people perceive them as a "tech giant" along with Google, Facebook and Amazon ' err.. we are talking about Amazon here

InclinedPlane · on Jan 31, 2013

Amazon is a weird company, and it has lots of parts. Even at, say, Microsoft there can be a huge amount of variation from division to division and team to team on how things are run, the corporate culture micro-climate, etc. At Amazon this is even more true, each team is substantially on their own, and while there is a certain amount of global overarching corporate culture every group is different and some groups buck against the trend successfully.

res0nat0r · on Jan 31, 2013

What a great Freudian slip.

akiselev · on Feb 1, 2013

They have one of the biggest logistics systems run by a large amount of software in the US, one of the biggest robotics deployments in the warehouse, AND they developed AWS on the IT side. Amazon's software is largely behind the curtains but they are definitely a tech giant.

aphexairlines · on Feb 1, 2013

> as obidos and gurupa were crap.

Except for the part where Gurupa enables scores of developers to build web apps that make hundreds of service calls yet emit results faster than the website we're using right now.

badgar · on Feb 1, 2013

The website we're on is restarted every few days because memory leaks are hard.

aphexairlines · on Feb 1, 2013

It could just be that mzscheme never returns memory to the OS. Perl doesn't.

badgar · on Feb 1, 2013

Not returning memory is different from a memory leak. Not returning memory means the memory footprint equals peak memory footprint. A memory leak is a bug in the program which causes space complexity in memory to grow unbounded. mzscheme certainly doesn't leak memory. HN leaks memory.