On Saturday, July 9, between 5:30pm EST and 6:30pm EST (21:30 – 22:30 UTC) we had a spike in users reporting response issues on some domains. Service degraded for the better part of an hour for some domains or (more likely) some parts of the world. As much as we hate to use the “regional outages” line, it’s looking like it.
Most of the effected domains were either: on the legacy system using the legacy nameserver delegation, or on the new system but had not yet switched their nameserver delegation to add the additional anycast constellation on the new platform.
I say most, because we have one report from a user who is fully on the new system who experienced issues.
We have had no reports from our Enterprise DNS users (if you are one and you had issues, please do let us know).
We initially believed this to be a general network issue (because we saw what looked like a corresponding spike in “* is down” reports on twitter, etc for unrelated domains around the same time). But as more data comes in, we are going to suck it up and say it: something weird happened, we think it was here.
1) We are still gathering data and we have identified ways to enhance our own internal monitoring capabilities so that we can cross reference what we see with what external monitoring applications are seeing.
2) We recently made a change in the way we adjust our BGP announcements for individual members of the anycast constellations to optimize response times. Since this has never happened before, and we enacted this change recently, we are suspicious that this may be at the core of the problem and we have rolled that back.
3) We are laggard in our announcement here, and for that I personally apologize. When the event occurred many of the systems group conferred about the incident on Saturday night and we suspected a wider network issue, but we still made the decision to rollback the new BGP programs just to be on the safe side. It was only after we saw more reports roll in today from users that we had to rethink our stance around this and accept that this was most likely our problem and not a network flap or outage.
We’re sincerely sorry to all that were affected and for the delay in this posting.