GitHub reveals database infrastructure was the villain behind February spate of outages. Again.

CI/CD
DevOps

By Team Devclass

March 27, 2020

GitHub reveals database infrastructure was the villain behind February spate of outages. Again.

GitHub has delivered its full post mortem on the series of outages that totalled over eight hours just last month, detailing exactly why its database infrastructure let it down. Again.

The post, from the Microsoft sub’s svp for engineering, Keith Ballinger described the February outage as “multiple service interruptions that resulted in degraded service for a total of eight hours and 14 minutes over four distinct events”.

The short explanation was that “Unexpected variations in database load, coupled with an unintended configuration issue introduced as a part of ongoing scaling improvements, led to resource contention in our mysql1 database cluster.” While the code repo outfit had been scaling up its data ops, “much of our core dataset” still resides in its original cluster.

The first outage hit on February 19, when “an unexpectedly resource-intensive query began running against our mysql1 database cluster”. While the plan was to run this load against its read replica pool at a much lower frequency, “we inadvertently sent this traffic to the master of the cluster, increasing the pressure on that host beyond surplus capacity.”

This all overloaded ProxySQL, “which is responsible for connection pooling, resulting in an inability to consistently perform queries.”

Two days later, “a planned master database promotion” once again triggered a ProxySQL failure.”

The third incident on February 25, again involved ProxySQL, when “active database connections crossed a critical threshold that changed the behavior of this new infrastructure. Because connections remained above the critical threshold after remediation, the system fell back into a degraded state.”

Then, on February 27, GitHub experienced the big one, with a four hours and 23 minute outage. This was because, “Application logic changes to database query patterns rapidly increased load on the master of our mysql1 database cluster. This spike slowed down the cluster enough to affect availability for all dependent services.”

Ballinger said GitHub had made changes to allow it to detect and address problems more quickly. “Remediating these issues were straightforward once we tracked down interactions between systems.” It was also devoting “more energy” to understanding the performance characteristics of ProxySQL at scale and the effects it can have on other services, before users are affected.

Ballinger added that “We shipped a sizable chunk of data partitioning efforts we’ve worked on for the past six months just days after these incidents for one of our more significant MySQL table domains, the “abilities” table. “ These changes have reduced load on the mysql1 cluster master by 20 percent, and queries per second by 15 percent.

The firm is also working to lower reads on master databases, and moving them to replica databases, and completing “in-flight functional partitioning of the mysql1 cluster, as well as identifying other domains to partition. It is also refining its dashboards, and sharding its largest schema set.

If it strikes you as odd that it is not making more play of better reporting, or introducing chaos engineering, that’s because it already pledged to do those back in 2018 when it suffered a 24 hour outage after a brief loss of connectivity threw its database clusters on the West and East coasts out of sync.

And it’s not like GitHub is along. Running a cloud platform is…hard. Parent Microsoft saw problems with its Azure platform this week, while at time of writing, Google was rolling out a fix after seeing problems across its GCP services.

React community splitting into full-stack and client-only camps, suggests survey

Executives have more confidence in software supply chain security than their developers

Why Facebook does not use Git – and why most other devs do

Devs say many of their hours are wasted, disagree with managers on how to fix the issue

Boomi takes aim at zombie APIs with control plane

Daunting downtime stats help put industrial DevOps under spotlight

Ladybird web browser now funded by GitHub co-founder, promises 'no code' from rivals

Django dev survey shows growing use of HTMX, Tailwind CSS, background workers approved

FrankenPHP and Caddy: Double performance, claims Caddy creator but only some apps benefit

Customers protest as JetBrains ends Space collaboration platform, intros SpaceCode as partial altern...

Interview: Developers spend too much time 'not coding' says Harness CEO

AWS combines "building block" blueprints with CodeCatalyst for rapid project creation including DevO...

GitHub reveals database infrastructure was the villain behind February spate of outages. Again.

ABOUT US

FOLLOW US