A flaw in the way Git calculates the difference between versions of the same file can bloat repositories by many times, causing performance issues and consuming excessive storage.
Microsoft senior engineer Jonathan Creamer has posted about a very large JavaScript Git repository with which his team work – a monorepo (a single repository which stores multiple related projects). In this case there are over 1,000 monthly active users and around 20 million lines of code. Cloning the repository consumed 178GB of disk space, Creamer reports – more space than one would expect.
The team consulted Git contributor Derrick Stolee, formerly at GitHub and now principal software engineer at Microsoft, who discovered that when comparing two files with a frequently used name (in this case CHANGELOG.md), Git was actually comparing files from different packages, and therefore finding a large difference with every commit.
Stolee submitted a pull request for Git that added what he called a “path walk API” which enables the software to group objects by path and “completely avoids any name-hash collisions.” Creamer applied the git repack
command to the large repository, using the newly available –path-walk
argument, and the size of the repository was reduced to 5GB.
Stolee posted on the Linux Kernel mailing list about this same issue, stating that “the main discovery was that the current name-hash algorithm only considers the last 16 characters in the path name and has some naturally occurring collisions within that scope.”
In another post Stolee noted that “The repository I was looking at had a clear pattern in its top 100 file paths by on-disk size: 99 of them were CHANGELOG.json and CHANGELOG.md files … what should have been a trivial set of deltas bloated to 20–60MB.”
Stolee gave examples of other repositories where the new options substantially reduced the space required, with one reducing from 130,049MB to 4,432MB.
The consequences of over-large Git repositories are not just excessive disk space, but also slow Git operations – sometimes to the point of complete failure depending on latency and available bandwidth.
Although dramatic space savings are possible, these examples are of large repositories which have many potential file name collisions. Typical Git repositories will not benefit in the same way. Nevertheless, developers will be keen to see these new features in the release version of Git.