We compared two datacenter outages that occurred relatively close to each other in both time and space. The first was the Namejuice/DROA/DROC outage which, in the absence of any information from the vendor we speculated upon and analyzed here. The other happened to us about a week ago.
Both occurred during North American peak business hours and they had vastly different durations. The customer impact on one was horrific, resulting in the loss of an entire day of uptime and productivity and sales for one set of customers. The customer impact on the other was minimal and barely noticed.
This chart cross-references 12 points of operational competence that make the difference between a catastrophic loss of uptime and a minor hiccup.
Namejuice / DROA
|Date||Sept 10 (Monday)||Oct 9 (Tuesday)|
|DC Location||Markham, ON||Markham, ON|
|Root cause:||Power Outage|
Core router failure
|Vendor provides status updates|
3:10pm via Twitter
|Vendor Workaround / Remedy|
Manually move routing announcements to a backup path.
Not under control of vendor
Announced via vendor ASN
Functioned as expected
|Duration of Outage|
|Reason For Outage (RFO) posted|
|Base cost of domain registration|
Outages happen to everybody. What separates a reliable vendor from the pack is how they respond to them.
- Are they setup with backup and alternatives from the outset?
- Do they quickly inform their customers of the ongoing status and issue a Reason For Outage afterwords?
- Most important: Do they learn from the outage and improve for next time?