Microsoft has revealed that Thursday’s worldwide outage was brought on by a code defect that allowed the Azure DNS service to grow to be overwhelmed and never reply to DNS queries.
At roughly 5:21 PM EST on Thursday, Microsoft skilled a worldwide outage that prevented customers from accessing or signing into quite a few providers, together with Xbox Stay, Microsoft Workplace, SharePoint On-line, Microsoft Intune, Dynamics 365, Microsoft Groups, Skype, Change On-line, OneDrive, Yammer, Energy BI, Energy Apps, OneNote, Microsoft Managed Desktop, and Microsoft Streams.
The service was so wide-spread inside Microsoft’s infrastructure that even their Azure standing web page, which is used to supply outage information, was inaccessible.
Microsoft’s ultimately resolved the outage at roughly 6:30 PM EST, with some providers taking a bit longer to operate once more correctly.
On the time, Microsoft acknowledged that the outage was brought on by a DNS subject however didn’t present additional info.
Azure DNS service grew to become overloaded
Final evening, Microsoft printed a root trigger evaluation (RCA) for this week’s outage and defined that it was brought on by their Azure DNS service changing into overloaded.
Microsoft’s Azure DNS is a worldwide community of redundant title servers that gives excessive availability and quick DNS providers.
In response to Microsoft, the Azure DNS service started receiving an “anomalous surge” of DNS queries from all around the world that have been focusing on sure domains hosted on Azure. Whereas Microsoft doesn’t clarify what this anomalous surge was, it could have been a DDoS assault focusing on sure domains.
Microsoft states that their DNS service may usually deal with a lot of requests by way of DNS caches and site visitors shaping. Nevertheless, a code defect prevented their DNS Edge caches from working accurately.
“Azure DNS servers skilled an anomalous surge in DNS queries from throughout the globe focusing on a set of domains hosted on Azure. Usually, Azure’s layers of caches and site visitors shaping would mitigate this surge. On this incident, one particular sequence of occasions uncovered a code defect in our DNS service that lowered the effectivity of our DNS Edge caches.”
“As our DNS service grew to become overloaded, DNS purchasers started frequent retries of their requests which added workload to the DNS service. Since consumer retries are thought of legit DNS site visitors, this site visitors was not dropped by our volumetric spike mitigation programs. This enhance in site visitors led to decreased availability of our DNS service,” Microsoft defined within the RCA for this week’s outage.
As nearly all Microsoft domains are resolved by way of Azure DNS, it was not doable to resolve hostnames on these domains and entry related providers when the DNS service grew to become overloaded.
For instance, the xboxlive.com area makes use of the next Azure DNS title servers to resolve hostnames on this area.
NS1-205.AZURE-DNS.COM NS2-205.AZURE-DNS.NET NS3-205.AZURE-DNS.ORG NS4-205.AZURE-DNS.INFO
Since xboxlive.com is hosted on Azure DNS, and that service grew to become unavailable, customers have been not in a position to login to Xbox Stay.
To forestall any such outage sooner or later, Microsoft states that they’re repairing the code defect in Azure DNS in order that the DNS cache can adequately deal with giant quantities of requests. In addition they plan on enhancing the monitoring and mitigations of anomalous site visitors.
BleepingComputer has contacted Microsoft to be taught extra about this anomalous surge however has not heard again presently.