GitHub has cast a sliver of light on the cause of the outages that have plagued the code hosting platform in recent weeks.
CEO Nat Friedman was forced to take to Twitter last week to apologise for the outages, after the Microsoft-owned platform took two substantial lie downs in a matter of days. However, while he said “we take reliability very seriously” he gave no reason for the company’s failure to deliver the same.
On Friday, GitHub svp for engineering, Keith Ballinger added his apology to the mix, before going some way to explain what the actual problem was.
“These incidents were distinct events, but have a common theme of uncovering new challenges in scaling our database tier,” he said. “Specifically, increased load on our largest database cluster contributed to degradations across multiple services.“
Ballinger promised “a more in-depth and technical report of these events and the work we are doing to improve the scalability and performance of our backend systems.”
By way of reassurance, he added, “We have several data partitioning initiatives already in progress, and we’ll be rolling out some of this work very soon. You can follow our status page for updates about the availability of our systems.”
Yelp that’s it. It’s all the database’s fault. Ballinger doesn’t go into depth about what database exactly is at fault. However, back in 2018, in the wake of a 24 hour outage, a lengthy mea culpa referred to problems with the MySQL clusters underpinning the service.
At the time, it said it would adjust the configuration of Orchestrator, which it used to manage the MySQL clusters, while a a pre-existing effort “to support serving GitHub traffic from multiple data centers in an active/active/active design…to tolerate the full failure of a single data center failure without user impact” was given added urgency. It also pledged to use more chaos engineering to envisage likely failure scenarios, and improve its reporting.
It’s fair to say it’s delivered on at least one of those – it was much easier for users to confirm it was indeed GitHub that was the problem last week….and the week before that.