May 30, 2014

What happened at Joyent to take down their datacenter?

On Tuesday, Joynet went down for a while, and I've heard it blamed on a mistake by a single admin. Could a single admin take down an entire data center with one mistake? After Joynet reneged on their promise of free lifetime hosting for the people that gave money for the company's crowdfunding campaign, I don't think there is a lot of goodwill left for them, and this doesn't help matters.

Fat-fingered admin downs entire Joyent data center

"The cause of the outage was that an admin was using a tool to remotely update the software on some new servers in Joyent's data center and, when trying to reboot them, accidentally rebooted all of the servers in the facility.

"The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the datacenter," Joyent wrote. "Unfortunately the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server in us-east-1 availability zone without delay."

At this point we imagine the operator emitted a high-pitched "oh dear oh dear oh dear" before watching the inevitable brown-out occur."

The best explaination I've read was from The Register: "Fat-fingered admin downs entire Joyent data center." Apparently, software on some servers was being remotely updated, and when the admin tried to reboot the updated servers, he or she accidentially rebooted ALL of the servers in the datacenter at once. Oopsie.  

Answer this