Recovery Strategy Improvements

Jun 8, 2009 Unfuddle

As we intimated in our previous post, the outage we experienced offered us an opportunity to evaluate our disaster recovery plan under pressure. While I am glad to say that no data was lost in the hardware failure, our team was convinced that there was definite room for improvement. After bringing the affected customers back online, we immediately began work on evaluating alternate systems and processes that would have shortened this weekâs downtime dramatically.

We are now taking snapshots of all customer data at 5 minute intervals. This provides us with two distinct advantages:

As many of you may know, Amazon EBS volumes are already redundant, a hardware failure on an Amazon EBS volume usually means a drastic reduction in speed, not a complete failure. This was the case on Wednesday. In the case of reduced performance, we can take down the affected server, take a final snapshot capturing any disk activity since the last 5 minute snapshot. This should go fairly quickly even on a volume experiencing problems.
In the case of the catastrophic failure on an EBS volume, we can very quickly restore customer data from the last snapshot losing only 5 minutes of data.

If we had been using the 5 minute snapshot scenario before Wednesday, the downtime would have been lessened to approximately 30 minutes – the amount of time for one of us to manually snapshot the affected volume, create a new volume from that snapshot and reattach it.

I want to thank all of you for your support and suggestions since this outage. Know that we are committed to the integrity and availability of your data and we will continue to evolve our systems and processes to make Unfuddle even more solid.