Skip to the content.

Dimensions of scale in Git

The phrase “giant Git repo” means vastly different things depending on who you talk to. Git has many different scale dimensions. It’s useful to identify these dimensions to think more precisely about what makes a “giant” repo.

Some of these dimensions are (at least partially) entangled. As the value of one gets bigger, another also gets bigger. Other dimensions are inversely related — as one aspect gets worse, another gets better. This can make diagnosing and solving problems tricky. Still, it’s good to start with an understanding of the space.

Sizes

Size is an obvious place to start. The more data Git has to operate on, the longer each operation will take, and the more space will be required.

Quantities

The next area which springs to mind are counts of things. Typically the scale factor is similar to sizes — more of these means more work for Git to traverse during its activities.

Activity

Of course, all of the sizes and quantities mentioned above change over time. Those changes can dramatically alter Git performance. In addition to rates of change in scalar values, there are other indicators of activity that drive load on Git or on a Git hosting service.

Race to update

The “race to update main” is a unique challenge that’s highly exacerbated in monorepos. It’s not particularly expensive for Git but it’s hugely expensive in terms of developer productivity.

When you try to push a change to your shared main branch, your commit has to have the current tip as one of its ancestors. But if someone else has pushed their change first, then you have to fetch that change and either rebase or merge your change onto theirs. This takes time, and in the meantime, someone else may again “win the race”.

This is a Git-level bottleneck. It doesn’t matter whether your changes overlap with the other developers’ changes (e.g. if you’re editing the same files or not). Merge conflicts make the “merge or rebase” part take longer, which can result in even more cases of losing the race. In a monorepo, this race can hit developers working on entirely different projects.

Shape

The entire “shape” of your repository’s history can have an impact on performance and certain operations. While you can’t always correct old mistakes, it’s worth at least understanding how they can bite you.

Git has a very simple data model at its core. In order to make it fast and scalable, there are a number of optimizations, caches, and “tricks” living above the simple model. Some of those optimizations and caches work best when they can assume a certain “shape” of the underlying data.

Deep vs shallow ancestry

The “deeper” a repo, the longer any operations will take which need to walk through history.

“Spiky” (many tips)

When a repo is too “spiky” – that is, it has many unmerged tips of refs – then some optimizations can’t be used. A great example is reachability bitmaps, which is a shortcut for knowing what objects are reachable from what commits. Because they’re expensive to create, reachability bitmaps don’t cover every possible entrypoint to the repository. The more ref tips you have, the less likely a given ref can take full advantage of the reachability bitmaps.

Merge structure

The way branches are commonly merged can affect operations need to find, for example, a common ancestor. Perfectly linear histories (always rebase, never merge) are easy to reason about, but deepen the repo. “Fork and join”-shaped histories (feature branches merged into main) are shallower but may be harder to process during fetches and pulls. Criss-cross merges, where branches are merged back and forth, can really exacerbate problems finding common ancestors.

Deltafyable content vs non-deltafyable content

Adding text-based content tends to work with the optimizations in how Git stores data, while non-text-based content tends to work against them. More important than “text” vs “non-text” is whether content is deltafyable. That is, when you make a small change to the logical content, does it yield a small change in the file’s bytes or a large one? Swapping one variable name in source code is highly deltafyable since most of the file is still the same. Two zip files made from nearly-identical contents may differ wildly due to compression, so they’re poorly deltafyable.

Client/server interaction

We would be remiss not to think about bandwidth and latency between client and server. This includes both the physical links as well as any intervening proxies, load balancers, and other equipment which may store (or even alter!) traffic on the wire.

Developer tolerance

Believe it or not, developer tolerance for performance and complexity is a key dimension of Git scale.

🏠 Back to front page