In the run-up to the much anticipated Xbox launch, the Microsoft Corporation all but disappeared off the internet as all of their key domains went dark: xbox.com, microsoft.com, outlook.com, etc.
While techcrunch ran an article positing that it was their Azure Cloud Platform that was at fault, the chatter on both the dns-operations and mail-ops mailing lists pointed instead at a widespread DNS outage, which seemed to bear up under a few queries:
markjr@marksandbox:~/$ host xbox.com ;; connection timed out; no servers could be reached markjr@marksandbox:~/$host outlook.com Host outlook.com not found: 3(NXDOMAIN) markjr@marksandbox:~/$ host microsoft.com ;; connection timed out; no servers could be reached
A few minutes later everything came back, but it must certainly have been (and continue to be) painful for microsoft.
This is a textbook case for Proactive Nameservers, where having a bunch of hot-spare nameservers in place waiting to go, with some monitoring of the performance of the nameserver delegation, then when the crap hits the fan, boom (or bing!) – the hot-spare nameservers are swapped into the delegation to hold up half the internet while everybody figures out what the hell went wrong.
It still surprises me how companies can pour millions or more into disaster recovery, server level failover, backups, pen-tests, DDoS mitigation, et al and still, usually, the nameservers are a blind spot. A really really fatal one.
What this case also demonstrates is the need for multi-provider or multi-solution redundancy at the nameserver level for infrastructures that require 100% DNS availability. While I’m sure that microsoft has put a lot of work into their DNS solution, it shows that all nameserver deployments (including managed DNS providers!) are logically a single-point-of-failure unto themselves. Yes, redundancies are built-in, yes anycast is used, yes, load balancers and HA switches, yes diverse network architectures and even heterogeneous DNS software is employed, and after all that, yes – shit happens.
You need to syndicate your DNS across multiple out-of-band deployments and have a coherent methodology to make sure those deployments are in sync with each other and when one of them is malfunctioning (or non-functioning), then you need to be able to switch between deployments by modifying the nameserver delegation on-the-fly.
See our original Proactive Nameservers video below (I really need to redo the audio track). There is also this non-sales-y article over on CircleID.