This morning around 8:30am EST monitoring services around the world, things like DNSStuff, Pingdom, etc began reporting DNS timeout errors for easyDNS back to their respective users. Oddly, most of the time a human check (like trying to get to a website in a web browser) would succeed. Not always.
To compound the weirdness, none of our datacenters were reporting or seeing anything like a DOS attack, non-standard traffic or anything of the like.
To further compound the problem, the symptoms were very intermittent and sporadic, surfacing one moment in one area (say, the San Jose nameservers, which at one point we stopped announcing via our anycast announcements), and then be ok literally the next moment.
Then we noticed a customer domain giving us the mother of all dynamic DNS updates, hitting us with over 800 host updates in under a minute. When we looked closer, it seemed to be hitting us with hundreds or thousands of dynamic host updates every few minutes.
What is supposed to happen in a case like this is the system is supposed to throttle those endpoints, giving them a “TOOSOON” error for pretty well every connection request, and if it keeps up, we blackhole the IPs sending them.
For some reason, we will call for what it is: our fault, the throttling systems for dynamic DNS were not functioning properly on the new user interface system. As a result, every few minutes all of these updates were being injected into the system, the servers throughout our network would choke as they attempted to replicate the same data hundreds or thousands of times in succession, only to have to do it again a few minutes later.
This obviously will not do and we will be spending our time re-engineering this part of the system and plan to have permanent fixes committed today. We have resumed anycast announcements for San Jose and Amsterdam nameservers.
We’re extremely sorry to have kicked off your workweek/monday morning (or monday afternoon for those users in the U.K) with an event like this. It is not or proudest moment. We will work diligently to ensure a repeat performance will never occur.
If you have any questions or concerns around this issue, feel free to email (markjr [at] easydns [dot] youknowwhat) or call, 1-888-677-4741 ext 225.
– mark
Brian Cunnie says
Hey, thanks for being upfront about the cause of the outage. It’s always takes a certain level of courage to truthfully describe the events leading to an outage, and you guys came through. It’s a big reason why I stay with you.
–Brian