Shoot The Automated Failure In The Head

This past week Github experienced their most significant service disruption of the year, and much of it came at the hands of an automated failover system they had designed to try to avoid disruptions. There are a number of different factors that made the situation as bad as it was, but the basic summary of what lead to the problem looks like this:

On Monday, they attempted a schema migration which lead to a load spike.
The high load triggered an automated failover to one of their MySQL slaves.
Once failed too, the new master also experienced high load, and so the automated failover attempted to revert back
At this point, the ops team put the automated failure system into “maintenance mode”, to prevent further failover

There’s actually more that goes wrong for them after this point, I encourage you to read the full post on the Github blog, but I wanted to focus on the initial problems for a moment.

Our database team at OmniTI is often asked about what type of process we normally recommend for dealing with failover situations, and we stand by our assessment that for most people, manual failover is typically the best option. It’s not that the idea of automated failover isn’t appealing, but the decisions involved can be very complex, and it’s hard to get that right in a scripted fashion. In their post, the Github team mentions that had a person been involved in the decision, neither of the failovers would have been done.

To be clear, manual failover should not mean a bunch of manual steps. I think many people get confused on this idea. When you do need to failover, you need that to happen as quickly, and as correctly, as possible. When we say “manual” failover, we mean that your goal should be to have the decision to failover be manual, but the process to be as scripted and automated as possible.

Another key factor in setting up scripted failover systems, and one that we see forgotten time and time again, is ye old STONITH concept. While it’s not 100% clear, from the description in the Github post, it seems that not only did their system allow automated failover, but it was also allowed to do automated fail-back. Just like any decision to failover needs to be manual, I always like to have at least one manual step involved after failover that is needed to reset the system as “back to normal”. This is extremely useful because it can act as a clear sign for your ops team that everyone agrees things are back to normal. Before that happens, your scripted failover solution should be unable to perform; why allow failover back to a machine that you’ve not agreed is ready to go back into service?

I know none of this sounds particularly sexy, but it’s battle tested and it works. And if you really don’t think you can wait for a human to intervene, build your systems for fault tolerance, not failover; just be warned that it is more expensive, complicated, and time consuming to implement (and the current open source options leave a lot to be desired in the options available to you).

Wondering about ways to help ensure availability in your environment? I’ll be speaking at Velocity Europe the first week of October, talking about ”Managing Databases in a DevOps Environment”; if you’re going to be attending I’d love to swap war stories. And yes, that’s the week after Surge, which is war story nirvana; if you haven’t gotten tickets for one of these events, there’s still time left; I hope to see you there.