Timeline of events
At about 9:30 P.M. (EST) on Monday, February 26, 2018, our operations team noted high levels of traffic to Hover’s DNS servers.
We identified a source of the traffic and implemented a block on a range of IPs at the load balancer in front of Hover’s DNS servers. We also implemented DDoS mitigation filters using our third-party DDoS mitigation provider. That cleared the alerts and the issue was closed a short time later.
Through the night and into the early morning, we received support inquiries about issues getting to some domains using Hover’s DNS. We determined through these reports, and via internal testing, that the problem was only seen when the user was using Google Public DNS (22.214.171.124 and 126.96.36.199) and only impacted resolution for some of those domains when accessed from some locations around the world.
We continued to work to identify the reason for the continuing problems users were reporting.
Mitigation via the third-party DDoS mitigation provider was confirmed to have been removed, eliminating that as a potential cause. At that point, based on the evidence we had gathered, we believed that Google Public DNS was potentially “negatively caching” the fact that our DNS servers were unavailable for a period of time the previous evening even though our servers were now completely available. We advised customers who contacted support that things should get better over time as the cached data expired out.
We continued to investigate and test throughout the afternoon. At about 3:35 P.M. (EST) on Tuesday, we discovered that the block on a range of IPs at the load balancer was still in place. It was also discovered that the IP range being blocked was in fact an IP range belonging to Google that was used for Public DNS queries.
The IP block at the load balancer was immediately removed at this point and the symptoms were gone almost immediately. Our testing confirmed that affected domains began to reliably resolve correctly again using Google Public DNS.
We’re deeply sorry. The impact of this event on you, and the those trying to access your websites or communicate with you via email, went on far longer than is acceptable to us. We could have and should have done a better job communicating throughout the incident via Hover Status. We’re taking specific steps to ensure that we do better going forward.
- We’ve immediately implemented a policy within our operations team to check ownership of any IP ranges prior to having an IP block implemented at the load balancer.
- We’ve taken several steps to improve communication within the operations team to ensure that any and all actions taken during events like a DDoS are properly documented and communicated across the team.
- Support staff will put a greater focus on communicating effectively and more frequently during incidents.