Mar 06, 2012

Why would "leap day" take down Windows Azure?

As many of us know, Azure had a significant service outage on Feb. 29th. Haven't we heard this tune before? I seem to recall a little something about 12/31/99 and y2k. I would assume Microsoft was using a boolean calendar for its software. That's what I don't get. How would it matter whether it was leap year or not if they were using a boolean calendar?

It is somewhat baffling that after all the Y2K hysteria, we would have a calendar related failure 12 years later. Apparently, this time it was caused by SSL certificates that were valid for a year....and by "a year" that meant 365 days. Unfortunately for some Azure customers, the number of days in a leap year is not 365, it is 366. Oops. It will be interesting to see if something similar happens four years from now, or if the lesson was learned.
Apparently the leap year date bug prevented the systems from knowing the correct time. See this article for details.

Yes, Microsoft Azure Was Downed By Leap-Year Bug

"Microsoft has confirmed that Wednesday’s Windows Azure outage that left some customers in the dark for more than 12 hours was the result of a software bug triggered by the Feb. 29 leap-year date that prevented systems from calculating the correct time.

In a post, Azure lead engineer Bill Laing said his team was able to put a fix in place that restored service to most customers around 3 a.m. PST on Wednesday, a little more than nine hours after it became aware of the issue. In a follow-up bulletin, he promised to provide a fuller post-mortem on the root cause soon. Point-of-sale terminals in New Zealand supermarkets were also reportedly bitten by leap-year bugs."
Answer this