Jul 06, 2012

How did Amazon have a cloud service outage that was caused by generator failure?

I know that "stuff happens", but when I read Amazon's explanation of why it had service failures during the recent power outages, I was left shaking my head. Apparently, when power went out at their data center and they tried to shift to backup power from diesel generators, they "failed to provide stable voltage as they were brought into service". This seems to me like the sort of thing that you test and make darn certain that everything works. Just checking to make sure that your generators start up is insufficient, in my opinion you have to regularly cycle and test your generators to avoid problems like this. Am I being too hard on Amazon here, or did they really drop the ball on this?


I agree, this is not unfamiliarity with some new or cutting edge tech/practices that caused the problem, it was a data center management/design problem.  A data center should have its backup system regularly load tested, including making certain that multiple generators are ably to synch outputs which sounds like may have been the problem.  If Amazon wasn't doing this, it was pretty slack on their part.  

No, you aren't being too hard on them. Your expectations sound quite reasonable. The entire situation underscores the dangers of being too reliant on the cloud. Users have no way of really knowing what's going on at the other end of their connection.

Perhaps it might be time to find another cloud provider? Amazon might have proven itself too unreliable or unprofessional in their practices.
Answer this