Microsoft says insufficient staff numbers contributed to Azure outage in Sydney 

Microsoft has published a preliminary report into an incident on 30 August that finds  insufficient data centre staffing levels contributed to an outage that saw Azure, Microsoft 365 and Power Platform services in Sydney go offline for up to 46 hours.

The event was triggered by a utility power sag after an electric storm in the Australia East region, which tripped a subset of the cooling units offline in one datacenter. 

While working to restore cooling, temperatures in the datacenter increased, Microsoft says that it powered down two data halls in an attempt to avoid damage to hardware, resulting in a loss of service for that zone. 

The cooling capacity for the two affected data halls consisted of seven chillers, with five chillers in operation and two chillers in standby. All five chillers faulted and could not be restarted manually as the chilled water loop temperature had exceeded the threshold. Only one of standby chillers could be started, and Microsoft was forced to reduce thermal loads by shutting down servers. 

Microsoft finds that there were several factors that contributed to delays in bringing storage infrastructure back to full functionality including (but not limited to): 

  • Due to the size of the datacenter campus, the staffing of the team at night was insufficient to restart the chillers in a timely manner. Microsoft says it has temporarily increased the team size from three to seven, until the underlying issues are better understood and appropriate mitigations can be put in place. 
  • The EOP for restarting chillers is slow to execute for an event with such a significant blast radius. Microsoft is exploring ways to improve existing automation to be more resilient to various voltage sag event types. 
  • Moving forward, Microsoft says that it is evaluating ways to ensure that the load profiles of the various chiller subsets can be prioritized so that chiller restarts will be performed for the highest load profiles first.

The full assessment of the outage is due within 14 days of the incident.