GitHub promises better reporting, tries chaos engineering in wake of 24 hour outage

GitHub promises better reporting, tries chaos engineering in wake of 24 hour outage

GitHub will junk its traffic light status system and invest in “chaos engineering tooling” to ensure there is no repeat of the series of unfortunate events that turned a 43 second connectivity loss into a 24 hour outage last week.

The code repo giant, which has just been taken over by Microsoft, issued an extremely detailed explanation of the outage yesterday, detailing almost minute by minute the events which led to both the outage and a series of over-optimistic statements about when full service would be restored.

Suffice to say that routine maintenance work on October 21 resulted in a 43 second loss of connectivity between its US East Coast network hub and its primary US East Coast data center.

A brief period of writes were not replicated to its West Coast facility, meaning “database clusters in both data centers now contained writes that were not present in the other data center, we were unable to fail the primary back over to the US East Coast data center safely.” Failing forward to the West Coast was complicated by the latency of a cross country trip.

By 23:19, it said, “We made an explicit choice to partially degrade site usability by pausing webhook delivery and GitHub Pages builds instead of jeopardizing data we had already received from users. In other words, our strategy was to prioritize data integrity over site usability and time to recovery.”

Unfortunately, the time to recover kept moving further away, with lengthy backup restores only adding to its problems. You can read the entire explanation here.

In the end, it assured customers “no user data was lost; however manual reconciliation for a few seconds of database writes is still in progress.”

The firm has clearly spent some soul-searching over the past eight days, and has now announced a series of “next steps”. These include ongoing analysis of the “relatively small” number of MySQL logs that were not replicated with a view to reconciling them.

It accepted that its estimates of when it would be back on line did not take into account all the “variables” and will “strive to provide more accurate information in the future.”

On a technical level it will “Adjust the configuration of Orchestrator (used to manage the MySQL clusters) to prevent the promotion of database primaries across regional boundaries.”

It has accelerated a migration to a new status reporting mechanism, “that will provide a richer forum for us to talk about active incidents in crisper and clearer language.” Simply being able to set status to green, yellow, and red “doesn’t give you an accurate picture of what is working and what is not, and in the future will be displaying the different components of the platform so you know the status of each service.”

A pre-existing effort “to support serving GitHub traffic from multiple data centers in an active/active/active design…to tolerate the full failure of a single data center failure without user impact” has now been given added urgency.

More broadly, “We have learned that tighter operational controls or improved response times are insufficient safeguards for site reliability within a system of services as complicated as ours.”

Hence, it will begin “validating failure scenarios before they have a chance to affect you” and invest in fault injection and chaos engineering tooling.

The outage happened just days before Microsoft completed its takeover of GitHub. Being part of a much, much bigger organisation might also be expected to help bolster its efforts. The again, Microsoft has not completely immune to outages, with Azure in general and its Azure DevOps service both taking a dive in recent months.