Amazon Outage: Recovery Process Itself an Unexpected Load
Amazon AWS in Northern Virginia continues to struggle to bring the crisis to a complete conclusion. From the AWS Service Health Dashboard:
Apr 23, 1:55 AM PDT We are continuing to work on unblocking the bottleneck that is limiting the speed with which we can re-establish connections between volumes and their instances. We will continue to keep everyone updated as we have additional information.
Earlier: 9:11 PM PDT
We wanted to give a more detailed update on the state of our recovery. At this point, we have recovered a large number of the stuck volumes and are in the process of recovering the remainder. We have added significant storage capacity to the cluster, and storage capacity is no longer a bottleneck to recovery. Some portion of these volumes have lost the connection to their instance, and are waiting to be connected before normal operations can resume. In order to re-establish this connection, we need to allow the instances in the affected Availability Zone to access the EC2 control plane service. There are a large number of control plane requests being generated by the system as we re-introduce instances and volumes. The load on our control plane is higher than we anticipated. We are re-introducing these instances slowly in order to moderate the load on the control plane and prevent it from becoming overloaded and affecting other functions. We are currently investigating several avenues to unblock this bottleneck and significantly increase the rate at which we can restore control plane access to volumes and instances– and move toward a full recovery.
The team has been completely focused on restoring access to all customers, and as such has not yet been able to focus on performing a complete post mortem. Once our customers have been taken care of and are fully back up and running, we will post a detailed account of what happened, along with the corrective actions we are undertaking to ensure this doesnt happen again. Once we have additional information on the progress that is being made, we will post additional updates.