GitHub has delivered its full post mortem on the series of outages that totalled over eight hours just last month, detailing exactly why its database infrastructure let it down. Again.
The post, from the Microsoft sub’s svp for engineering, Keith Ballinger described the February outage as “multiple service interruptions that resulted in degraded service for a total of eight hours and 14 minutes over four distinct events”.
The short explanation was that “Unexpected variations in database load, coupled with an unintended configuration issue introduced as a part of ongoing scaling improvements, led to resource contention in our mysql1 database cluster.” While the code repo outfit had been scaling up its data ops, “much of our core dataset” still resides in its original cluster.
The first outage hit on February 19, when “an unexpectedly resource-intensive query began running against our mysql1 database cluster”. While the plan was to run this load against its read replica pool at a much lower frequency, “we inadvertently sent this traffic to the master of the cluster, increasing the pressure on that host beyond surplus capacity.”
This all overloaded ProxySQL, “which is responsible for connection pooling, resulting in an inability to consistently perform queries.”
Two days later, “a planned master database promotion” once again triggered a ProxySQL failure.”
The third incident on February 25, again involved ProxySQL, when “active database connections crossed a critical threshold that changed the behavior of this new infrastructure. Because connections remained above the critical threshold after remediation, the system fell back into a degraded state.”
Then, on February 27, GitHub experienced the big one, with a four hours and 23 minute outage. This was because, “Application logic changes to database query patterns rapidly increased load on the master of our mysql1 database cluster. This spike slowed down the cluster enough to affect availability for all dependent services.”
Ballinger said GitHub had made changes to allow it to detect and address problems more quickly. “Remediating these issues were straightforward once we tracked down interactions between systems.” It was also devoting “more energy” to understanding the performance characteristics of ProxySQL at scale and the effects it can have on other services, before users are affected.
Ballinger added that “We shipped a sizable chunk of data partitioning efforts we’ve worked on for the past six months just days after these incidents for one of our more significant MySQL table domains, the “abilities” table. “ These changes have reduced load on the mysql1 cluster master by 20 percent, and queries per second by 15 percent.
The firm is also working to lower reads on master databases, and moving them to replica databases, and completing “in-flight functional partitioning of the mysql1 cluster, as well as identifying other domains to partition. It is also refining its dashboards, and sharding its largest schema set.
If it strikes you as odd that it is not making more play of better reporting, or introducing chaos engineering, that’s because it already pledged to do those back in 2018 when it suffered a 24 hour outage after a brief loss of connectivity threw its database clusters on the West and East coasts out of sync.
And it’s not like GitHub is along. Running a cloud platform is…hard. Parent Microsoft saw problems with its Azure platform this week, while at time of writing, Google was rolling out a fix after seeing problems across its GCP services.