AWS Blackout & Mitigating the Risks
On February 28th AWS experienced a significant outage that took down a significant number of websites and services including Trello, IFTT, Amazon Instant Video and even Amazons own Voice Command services Alexa.
The issue was caused by an internal member of staff who was debugging issues with billing and turned off more servers than anticipated.
“Removing a significant portion of the capacity caused each of these systems to require a full restart,” the post read. “While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.”
AWS are said to be making changes that will prevent this error from happening in the future as covered in the statement found here,
How can you mitigate the risks of an outage beyond your control?
Clearly moving more services to the cloud such as Azure and AWS raises the question of how can you keep your important services running in the case of an outage at the Public Data centres.
The services affected for AWS were hosted in one of their data centres in US-East-1. To mitigate the risks and prevent his outage from affecting your business there are two possible options.
Azure Site Recovery
Azure Site Recovery allows you to replicate your servers whether they are hosted in AWS, Azure or even On Premise to a completely separate data centre and even to a different public cloud or your local environment if you wish. Employing this technology when configuring your environment can give you a truly resilient environment and provide enhanced Disaster Recovery for your infrastructure.
Build Highly Available Applications
The second option is to build highly available applications and infrastructure into your architecture and make sure your resilient from outages by spanning your environment across multiple data centres and even public clouds. Using technologies such as Microsoft SQL Servers built in Always On and IIS Load Balancing you can achieve highly available environments that mitigate the risks of an outage. Services such as Azure SQL and Azure Web App Service have built in high availability by replicating your data automatically to multiple Azure data centres.