RCA: Intermittent DNS1 Issues on April 3rd 2024

Background: Until March 31, 2024, easyDNS was contracting with Cloudflare to provide DNS firewall services on our DNS1 and DNS2 anycast constellations. After a good run, lasting over 10 years, we decided to go in a different direction and provided them with notification to that effect back in January.

On April 1st, after removing RPKI authorizations from ARIN and coordinating with Cloudflare in the lead up to the cutover, we received multiple confirmations from their team that they had indeed stopped BGP announcements for our address space. We monitored things for a couple of days to ensure everything was stable and performing as expected, then proceeded with the final step of turning down the Cloudflare firewall.

At that point, beginning the morning of April 3rd, 2024 EST – some customers began reporting outages with DNS1.EASYDNS.COM. All of which seemed to still be querying CloudFlare, not our new DNS1 setup – it turned out this only affected anybody who:

• Was also using Cloudflare for their websites, or
• Was directly or indirectly using Cloudflare’s 1.1.1.1 resolver service

DNS2 was less affected because we used a different method to deploy that constellation. But any queries hitting Cloudflare via DNS1 would stop there with a failure rather than trying another nameserver due to the type of response Cloudflare was returning for queries that hit their network.

Repeated attempts to engage with Cloudflare met with confusion and delays, as we attempted to escalate the situation. It seems many on the Cloudflare side really thought that everything should be fine and couldn’t figure out how to fix the issue.

But what was happening, as far as we can determine:

• Cloudflare was internally still reverse proxying our DNS
• Queries coming in for domains delegated to our nameservers using Cloudflare as a resolver would fail in a way that stopped them from cycling to the next NS
• Cloudflare engineers were having difficulty trying to fix the issue, and once they did we still had to wait out TTL for the corrections to fully propagate

Although we took great care with the planning and implementation of the migration, the outage as well as resolution was outside of our control. But ultimately our responsibilities are to our clients, and we failed you in two key respects:

1. We didn’t announce the changeover in advance.

As a DNS provider, we’ve made it our business to be able to maintain and upgrade our nameserver fleet in a seamless manner, unnoticeable to the end-user. For over twenty years, we’ve been re-numbering nameservers, moving BGP announcements around, changing our nameserver architecture, upgrading – it’s never been an issue.

But this was a significant shift, and even though the cutover to the new setup went fine, the turndown on the old vendor went badly – and we never considered that possibility ahead of time.

We should have announced this in advance so those impacted would have known the cause.

2. We were looking for problems in all the wrong places

As we implemented the migration, our focus was on the new setup – making sure it was ready, tested and working. When planning out an upgrade like this, we always have a backout plan for what we’ll do if things go wrong. We really didn’t consider what we would do if the old system shutdown happened ungracefully.

At the end of the day, we had no control over it, but it’s something we should have considered more carefully, if only to have a well-defined escalation path mapped out in advance.

We’re very sorry to those of you who were impacted by this. Rest assured, we have learned from this and will up our game accordingly.

We’d like to acknowledge Cloudflare for their past service, and a special thank you to the engineers and support staff working on this who were doing their level best to sort out a departing client in a confusing scenario. There are some vendors who wouldn’t care, so at least there’s that. We wish them well in their future endeavors.

I personally hope the acknowledgement of our own failures will be a first step in rebuilding the trust we’ve worked so hard to build with all of our customers the last two decades we’ve served you.

RCA: Intermittent DNS1 Issues on April 3rd 2024

easyDNS Technologies, Inc. Founded In 1998

Products & Services

Support/Help

Resources & Tools

About easyDNS

Contact Information

How can we help?

Tell us more

Reader Interactions

Leave a Reply Cancel reply

Footer

easyDNS Technologies, Inc. Founded In 1998

Products & Services

Support/Help

Resources & Tools

About easyDNS