GitHub greenlights revamped status page, after red (faced) October

GitHub greenlights revamped status page, after red (faced) October

GitHub has hit the green button on a new status page, just over a month after a catastrophic outage that showed its current mechanisms were just not up to par.

In October’s snafu, routine maintenance work resulted in a 43 second loss of connectivity between its US East Coast network hub and its primary US East Coast data center, before a cascade of unintended consequences turned this into a 24 hour outage, exacerbated by over-optimistic status reports.

In response it accelerated an overhaul of its reporting, promising to will junk its traffic light status system and invest in “chaos engineering tooling”. It probably didn’t help that the outage came days after it was officially taken over by Microsoft.

This week it unveiled a new GitHub Status Site, and began deprecating the previous incarnation, meaning customers who have integrated it into their own ops have some work to do.

The new site is said to be more granular than its predecessor, listing the individual platform components and, crucially, their individual statuses. “This makes our messaging during an incident more accurate and reliable,” GitHub staffer Jamie Hannaford said in a blogpost.

In addition, he said, component status updates are decoupled from the lifecycle of an incident, and specific mitigation steps can be shared: “In other words, status updates are snapshots in time of a specific component, and incidents are trackable communications between GitHub and customers.”

If you rely on GitHub, or simply have excessive FOMO, you can now subscribe to our status changes in multiple ways, such as email, SMS, or webhook delivery. “These subscriptions can follow the entire lifecycle of an incident from investigation to remediation,” Hannaford said.

The old site – and its API – will be terminated over the next three months, with a redirect kicking it at the end of February, to coincide with the killing of the old API.

In the meantime, the new site will be tested on live systems, with a test of the new incident response workflow on December 18.

Hannaford warned, “You will notice during that time that services may appear to be degraded via the status site but that won’t be the case. If you do find any issues or problems with the status site during this time, please reach out to us.”