Root Cause Analysis
An infrastructure fault was experienced by our cloud service provider (CSP) resulted in a number of our servers being taken offline.
A hardware issue experienced by our CSP resulted in a number of our servers being taken offline ungracefully and automatic recovery mechanisms failed to restore them. This resulted in the servers constantly trying to start.
Summary of resolution
Quarantine the failed servers to restore service. Manual intervention was required to redeploy the servers to new hardware.
This was a very rare occurrence. New procedures are now in place to detect this kind of fault so we will not be impacted in this manner again.