Outage Notification

May 28, 2009 Unfuddle

Earlier today, May 27, at approximately 10:15EST, one of the Unfuddle servers experienced a hardware failure with its attached storage (an Amazon EBS volume).

Immediately upon failure, we contacted the Amazon support team and began the process of diagnosing the problem. At approximately 20:00EST, the hardware failure was remedied, the volume was restored and all Unfuddle accounts on that server were available as normal.

Why did we take so long to respond? Unfuddle keeps hourly snapshots of all customer data, so it would have been possible from the very moment of the outage to revert to a saved snapshot. However, doing so would have caused everyone on the server to lose approximately one hour of activity on their account – a situation we clearly wanted to avoid. As we worked with Amazon throughout the day, it was looking probable that the data on the volume would be recoverable, avoiding any data loss. Unfortunately, only in the early evening was it actually guaranteed to us by Amazon the volume was intact and had been recovered successfully.

As many of you know, we have been with Amazon EC2 since the beginning of this year and this is the first significant outage we have experienced since then. Our current data partitioning and snapshotting scheme has been excellent at mitigating risk for our customers. Even today, only about 7% of all Unfuddle accounts were affected. However, we do not consider this outage to be acceptable, and in hindsight we should have probably not waited for the volume to be rebuilt, but rather restored directly from the last viable snapshot.

This morning’s events have given us some very practical ideas as to how we can even further improve upon our snapshotting strategy so that this kind of hardware failure is even less likely to affect our customers in the future. We are already working on implementing these changes.

We apologize for the disruption that this outage has caused you and your teams. As a software development team ourselves, we truly understand the kind of problems that this has caused.

May 28, 2009 Tim

Thank you, UnFuddle staff, for the quick and frank disclosure of this incident and your response. What doesn't kill us makes us rethink our disaster recovery plan! Still love UnFuddle and am happy to see it hardened by experience. My data is still there and I am that much more confident that it will continue to be so.
May 28, 2009 Chris

Thank you for that summary. I definitely appreciate knowing what's going on, and I wince in sympathy at what a hectic mess this must have made of your day. Happy to have my stuff available again with no losses, and glad you folks have my back.
May 28, 2009 Guy

From my experience with EC2 and with other server systems, the best way to mitigate any data loss is to keep a running (streamed backup) if possible.

If using Postgres it would be based on WAL shipping to S3 (or ANOTHER EBS volume). If using MySQL - binlog shipping.

If you would like more info and some ideas on how to do it, let me know.
May 28, 2009 Joshua W. Frappier

Thanks for your understanding everyone. It certainly has been an interesting day!

@Guy. Thanks for the advice. After working more with the Amazon team today, we are now honing in on some very specific solutions that will bring us much closer to real time backup of all customer data.
May 28, 2009 James

Thanks for letting us know the details- that's the important thing. We've all had something fail at some point, so can relate.

I'm with Guy on the streaming backup. I presume you're running off some form of SQL server, so it shouldn't be hard to have a slave setup somewhere replicating the main database (Ideally completely separate from Amazon).

I believe subversion has replication support built in as well. Not sure about Git, but it would be trivial to recover from anyway- the end user just has to do another push.

That way in the event of a hardware failure you have a backup that is probably only a few seconds old (And at most a few minutes).

(Of course you could still accidentally replicate an incorrect DELETE statement, but if you keep the hourly backups as well it's not the end of the world)

It would cost you a little more to run, but I think it's worth it. I'd honestly be more than willing to pay a little extra on my account to help offset this cost- you're pretty cheap as it is, and I'm more concerned about reliability than keeping my margins down.
May 28, 2009 David Laings

I really appreciate your full disclosure - we all know failures happen; what's important is how you deal with them.

I'd be particularly interested in some more details of your snapshotting strategies - I'd like to pick up some tips for the web based applications I maintain for my clients.

Regards

David
May 28, 2009 Richard Vanbergen

I wish all companies would be this frank about what happened.

Currently I'm on a free account but I'm seriously concidering upgrading in the comming months. Although the outage didn't effect me (I havent had any activity in about 12 hours) I feel more secure knowing that you as a company know the importance of transparency, honesty and learning from mistakes.

Well done! :)
May 28, 2009 Damen

Thanks for the frank and honest update. I have been impressed by the rapid feedback and support. You have handled this well.
May 28, 2009 Chris

So the snapshots are taken every quarter-after? Interesting choice, but only 15 minutes of changes would have been lost if it were done on the dot :).