On Friday a large chunk of the internet went off the air when Cloudflare apparently fat-fingered a routing update and sent all of their global traffic to a single POP, vaporizing it almost instantly.
This affected their DNS service, and of course, as everybody knows, when your DNS is gone, so are you. At least one other commercial DNS provider who uses Cloudflare in front of their own nameservers for DDoS mitigation also went off the air.
We’re familiar with Cloudflare’s DDoS service for DNS providers, because we use it ourselves. Fortunately easyDNS was not impacted by the outage (I didn’t even notice it, tbh), and I only heard about it later in the day when I checked in on social media at some point and saw all this chatter about “half the internet blowing up”.
EasyDNS was unaffected because while we do use Cloudflare to soak up large DDoS attacks against our nameservers, we don’t use them across all of our nameservers. I think somewhere in my book I wrote “DNS providers have a near-pathological aversion to SPOFs” (Single Point of Failures). Maybe only we do.
This is why whenever one of the largest DNS providers in the world blows themselves up, or gets DDoSed off of the air we are quick to point out two things:
- This is inevitable and unavoidable and entirely excusable. Everybody blows up, every DNS provider in existence will experience downtime. No exceptions.
- There is a silver bullet for avoiding your own downtime when your DNS provider blows up and it is to use multiple DNS providers (a point we’ve belaboured many times in the past is that every DNS provider is a logical SPoF unto itself).
At easyDNS we experienced so much pain from this reality that we created a system to automate flipping DNS providers at the first sign of trouble.
We call it Proactive Nameservers, and we’re the only company in the world doing it for some reason. Maybe this is because in order to provide a service like nameserver failover, it means a company has to admit to its customers the reality that their own nameservers may at some point, fail.
The two approaches to multi-DNS architectures are active/active: use multiple DNS providers all the time, or active/passive, which is what Proactive DNS does.
For active/active there are myriad ways to do it, you can use things like our easyRoute53 integration into Amazon Route 53 DNS, so you only need to manage your DNS settings in one place, or just use plain-old-fashioned secondary DNS at some out-of-band provider. Tools like OctoDNS can help you automate across multiple providers (on that note, easyDNS support for OctoDNS is either out now or in the process of being committed).
See our High Availability DNS page for more info on integrations and lo-fi methods of doing this.
Again, from my book, even a single unicast node staying up when all else is down will get you through a major network event like this unscathed.
But if you want to use a preferred DNS provider, such as Cloudflare, who use their DNS responses to optimize your website proxy. That works best most of the time, so then you want to go with an active/passive model that will step back when things are going according to plan, and then when these periodic network cataclysms do occur (and they will), they step into the breach and update your nameservers so that you at least stay up until the crisis is over.
The only requirement to use Proactive Nameservers is that we have to be your registrar, because we need to connect to the registry to update your nameserver delegation. If for some unfathomable reason we aren’t your preferred DNS vendor, you can stick with one who is and just transfer your domain here. (we even have a transfer valet to do all the heavy lifting for you if you need it).
Learn more here (including pricing) or check out the original Proactive DNS explainer video….
Gido says
Jesus Christ, this is just beating CF while they’re down, but I still advocate for running your own DNS, and I have a secondary off my network (puck.nether.net/DNS, see an example of my NS system with `dig NS clickable.systems`).
Mark E. Jeftovic says
How is this beating CF while they’re down? This is just reality.
You want beating somebody while they’re down? Talk to the sales teams at some other companies who like to fire up the boiler room when a competitor is getting DDoSed and telling their customers that their vendor is going out of business. Everybody in the DNS space knows who I’m talking about.
E Connelly says
Love how (some yrs ago; computers:wave of future so secure, so work freeing; A LOAD OF SCHEIST
Revelation Now says
Actually, the outage was caused by a BGP routing update that was incorrect and caused traffic to fail to route via Cloudflare as intended.
No matter how ‘highly available’ your DNS servers were, its most likely nobody would have been able to reach your servers either if they needed to cross the impacted networks. DNS, just like HTTPS or web traffic, is just a simple application that sits on top of the Internet. Please don’t conflate Cloudflare’s inter-network routing problems with simple DNS failures.
Mark E. Jeftovic says
I don’t think you’re paying attention.
1) We use cloudflare (and during the outage 2 of our nameserver anycast strands were affected, dns1 and dns2)
2) we dislike SPOFs, so dns3 and dns4 do not use cloudflare.
3) cloudflare goes down
4) easyDNS does not go down (see #2). At least one other DNS provider using CF for all nameservers does goes down. Anybody using CF DNS goes down
What proactive nameservers does, is if your nameservers go down, it switches your delegation to nameservers that are not down.
If you set up all your nameserver pools on the same provider, and that provider goes down, then yes, you are out of luck. But there is also no point in setting it up that way.
Toni says
What happens if your proactive system has a failure where it mistakes all nameservers to be down?
And what happens if there is an issue with how it updates the nameservers and butchers them doing so?
Also, if you rely on CF’s proxy and network features, it’s a SPoF you’ll just have to live with.
Mark E. Jeftovic says
Sometimes proactive nameservers flips your nameservers on a false positive, or let’s say an overreaction, i.e. there is degraded performance in your primary pool but it’s not really impacting anything. In fact, this happened to us, yesterday, when a node in France that is part of DNS4 anycast degraded for a few minutes.
Nobody noticed, and typically nobody will notice, because the system makes sure that your backup pools are in sync with your primary master. It won’t flip you to a set of nameservers that are out of sync. So if your zone temporarily gets served by your backup pool, which is serving the same zone, it doesn’t make any difference.
We continually strive to eliminate false positives, but we’d rather have false positives where the system kicks in than a missed negative, where it doesn’t kick in.
Mark E. Jeftovic says
The other thing I’ll add to this is you can setup proactive nameservers to either *add* the backup pool to your existing delegation, or *switch* to the backup pool.
We have ours set up to *add*, so even if there’s a false positive, what that means is for a brief period you are running on more nameservers than usual.