GitHub outage exposed Rust crates.io flaw

GitHub outage exposed Rust crates.io flaw

GitHub’s wobbles over the last couple of weeks helped expose a bug in the crates.io webapp, the Rust team revealed last week.

Crates.io is the Rust community’s crate – or package –  registry. The Inside Rust blog last week reported that on the February 20 the project received a report “from a user of crates.io that their crate was not available on the index even after 10 minutes since the upload.”

The upshot was that this was down to “a bug in the crates.io webapp exposed by a GitHub outage.” Drilling down to the root cause, the team found, “In some corner cases the code that uploads new commits to the GitHub repository of the index was returning a successful status even though the push itself failed. “

The bug meant that the job scheduler thought the upload was actually successful, “causing the job to be removed from the queue and producing a data loss.”

They went on to say, “The outage was caused by that bug, triggered by an unexpected response during the GitHub outage happening at the same time.”

Once the team found the offending code, a fix was dashed off and deployed directly into production, and “At the same time, once we saw the index started to be updated again, we removed the broken entries in the database manually and asked the reporter to upload their crates again.”

That said, the team said deploying the change took longer than expected, because changes that had landed in master were waiting to be deployed in production, stretching the build process.

“In the future we should deploy hotfixes by branching off the current deployed commit, and cherry-picking the fix on top of that. We should also strive to reduce the amount of time PRs sit in master without being live,” they concluded.

The problem also highlighted some problems in Rust’s own monitoring setup. “We have monitoring in place for jobs failing to execute, but in this case the job was mistakenly marked as correct. We should implement periodic checks that ensure the database and the index are correctly synchronized.”

The team said its investigation also led it to concluding, “our logging was not good enough to properly diagnose the problem: there is no message logged when a commit is pushed to the index, nor when a background job is executed.”

Meanwhile the Rust team has shipped v1.41.1 of the programming language. The update fixes a brace of critical refressions, including a soundness hole related to static lifetimes, and a miscompilation which was causing segfaults.