Infinite Ascent.

by CJ Quineson

Thirteen Ways of Looking at Git

with apologies to wallace stevens

I
Among twenty file versions,
The only one that was visible
Was the one that main points to.

II
I was of three minds,
Like a Git repository
In which there are three collaboration styles.

III
The commit whirled in the remote repository.
It was a small part of the distributed system.

IV
A commit and a commit
Are one.
A commit and a commit and a commit
Are one.

V
I do not know which to prefer,
The beauty of what is
Or the beauty of what should be,
The commit merged
Or rebased.

VI
Commits filled the Git log
In a tangled mess.
The branch labels
Crossed it, to and fro.
The mood
Traced in the movement
An indecipherable development.

VII
O thin men of Haddam,
Why do you imagine IPFS?
Do you not see how Git
Has content-addressable storage
In the hashes about you?

VIII
I know noble trees
And lucid, inescapable DAGs;
But I know, too,
That Git is involved
In what I know.

IX
When the objects were compressed,
It marked the edge
Of a repository clone.

X
At the sight of terminal emulators
With letters in a green light,
Even the Git CLI
Would cry out sharply.

XI
He rode over Connecticut
In a glass coach.
Once, a fear pierced him,
In that he mistook
A many-sorted Merkle tree
For a Git repository.

XII
The staging area is moving.
The working directory must be active.

XIII
It was evening in the bazaar.
There were emails
And there were going to be emails.
The Git repository sat
In the open-source community.

My priorities for writing were covering a diverse set of perspectives on Git, followed by matching the form of the original. I had to sacrifice having a consistent interpretation of what the blackbird was, instead changing it to whatever Git-related concept made sense. But I’m quite happy with the result.

I
Among twenty file versions,
The only one that was visible
Was the one that main points to.

Git is version control software, and it was first introduced to me as a way of maintaining different versions of the same file. Source-control management is still the heart of Git; even the website is called git-scm.com. But it’s grown to be many different other things.

Perhaps it’s here where it makes the most sense to talk about how Git’s versions act as both snapshots and diffs. These are two ways of seeing Git’s commits, both of which are useful in different contexts, but it’s one I don’t explore much here, mostly because I’m more interested in viewing Git as a whole.

II
I was of three minds,
Like a Git repository
In which there are three collaboration styles.

Git doesn’t come with a prescribed way to use branches and remotes, but the blessed Pro Git book comes with a section on branching workflows and distributed workflows that capture some different ways people use Git.

Compare these two repositories I’ve worked on. The openai-node repository has a stable main branch and an unstable next branch, and releases are done by merging the changes from next into main. This contrasts with remeda, which has an unstable main branch, and stable versions are marked with tags, which appear in GitHub as releases.

Neither is better than the other. But confusion arises when people used to one style work in another.

III
The commit whirled in the remote repository.
It was a small part of the distributed system.

The concept of a distributed version control system is native to me, but it only became well-known in the late 90s. I’ve worked on projects that have used RCS and SVN, and neither support syncing repositories between multiple systems; collaborators instead work on the same remote machine.

Git emerged to fill the need of a free and distributed system for Linux kernel developers. Each developer had a local copy of the repository, and changes between repositories were synced. Git did not originate this idea; rather, it takes inspiration from BitKeeper, which the Linux kernel used for version control until its creators BitMover Inc. withdrew the free version.

IV
A commit and a commit
Are one.
A commit and a commit and a commit
Are one.

Suppose that Alice and Carol edit a list of groceries. Alice’s edit results in apricot, banana, cherry, while Carol’s edit results in apple, banana, carrot. What is the correct way to merge their edits? Given only this information, it’s impossible to tell.

Then suppose I told you that their original list was apple, banana, cherry. Then we can tell that Alice replaced apricot with apple, and that Carol replaced cherry with carrot, so the correct merge is apricot, banana, carrot. Contrast this with the world where the original list was instead apricot, banana, carrot; then the correct merge would be apple, banana, cherry. This is the heart of the three-way merge, which forms the basis of Git’s merge strategies.

Merge strategies are some of my favorite kinds of algorithms. The properties of three-way merge can be subtle, and to use it well in practice might require remembering conflict resolutions. The main alternative is patch theory, as introduced by Darcs and extended by Pijul. There’s exciting developments with syntax-aware merging, as advanced by Mergiraf and SemanticDiff.

V
I do not know which to prefer,
The beauty of what is
Or the beauty of what should be,
The commit merged
Or rebased.

Git provides many tools to rewrite history, which means that when merging in changes from one branch to another, we can create genuine merge commits, or we can lie. We can carry the grotesque truths, with honest commit timestamps and correct depictions of code-as-written. Or we can lie, and pretend that all software is written in a linear fashion, with each developer holding a lock on the repository and only writing code on top of it.

There are several ways to lie as well. GitHub calls two of them squash and merge and rebase and merge, while GitLab calls them fast-forward with squashing and fast-forward without squashing, while also giving the option of merge commits with semi-linear history.

VI
Commits filled the Git log
In a tangled mess.
The branch labels
Crossed it, to and fro.
The mood
Traced in the movement
An indecipherable development.

I find the name branch misleading. It gives the impression of an immutable line of history, a single sequence of commits, tracing its ancestry to the dawn of the repository. This is not the case. A branch is a mere label for a commit, which happens to be a label that moves automatically. With the correct words, we can move the label wherever we see fit.

VII
O thin men of Haddam,
Why do you imagine IPFS?
Do you not see how Git
Has content-addressable storage
In the hashes about you?

The usual file system stores references to its files via filenames. Names are unverifiable, as anyone can assign whatever name they want to a given file, and their mutability means that someone looking for a renamed file might not find it. IPFS deals with these issues by referencing files via their content, forming a system called content-addressable storage.

Git itself is a content-addressable file system, as explained in one of the Pro Git chapters. The core is a key-value store from hashes to Git objects. The most common Git object is the blob, which corresponds to the contents of a file, stripped of its metadata. Its the hash is the SHA-1 (or SHA-256) of blob , its length, a null byte, and its content.

The idea of addressing things by content has benefits useful elsewhere; see, for example, the Unison programming language.

VIII
I know noble trees
And lucid, inescapable DAGs;
But I know, too,
That Git is involved
In what I know.

The commit graph is notable enough that xkcd describes Git via its beautiful […] graph theory tree model. Describing the commit graph as a tree is inaccurate, as every merge commit has at least two parents; rather, it’s a directed acyclic graph.

Because the commit graph is a DAG, Git comes with several graph algorithms built-in. One of them is git merge-base, a generalized implementation of lowest common ancestor. Plenty of tree traversals happen when checking for reachability, which commands like git gc do. There’s a whole API for printing graphs in the CLI, which has topological sorting built-in, and supplemental data structures keep the algorithms fast.

IX
When the objects were compressed,
It marked the edge
Of a repository clone.

Git comes with a protocol for transferring files, build on two pairs of commands: git send-pack and git receive-pack, plus git fetch-pack and git upload-pack. These commands start a dance of messages, where the client and server tell each other which objects they have, and which objects they want, called packfile negotiation. The sender computes which data to bundle in the packfiles, which themselves store data in a compressed format.

If you’ve worked with large Git repos, you might have heard of partial clones and shallow clones, where you can get a copy of the repository without any blobs, or without any trees, or only to a certain depth. These depend on newer versions of Git servers that support these capabilities. Thus, the Git protocol must be designed in a way so that clients and servers running different versions of Git can still talk to each other, and now we run into the same kinds of issues we do when we think about API versioning.

X
At the sight of terminal emulators
With letters in a green light,
Even the Git CLI
Would cry out sharply.

Git’s CLI is notorious for being difficult to understand. But some of the criticism floating on the internet is a bit dated. For example, complaints about git checkout doing too many things should be addressed by the introduction of git switch and git restore in 2019.

Still, larger efforts to make the Git CLI more friendly aren’t going to happen on Git itself. Today, developers have a wide range of Git-compatible tools at their disposal; the most popular include Jujutsu, GitButler, GitUp, and Sapling. But will any of them support git column? I think not.

XI
He rode over Connecticut
In a glass coach.
Once, a fear pierced him,
In that he mistook
A many-sorted Merkle tree
For a Git repository.

I’ve first heard of Merkle trees from their use in Bitcoin and Ethereum, and I found the idea quite clever. At its heart, a Merkle tree is any tree with hashes assigned to each node, such that each hash uses the hashes of its dependencies. Git’s object model uses this idea to generate hashes, predating their use in Bitcoin by three-ish years.

We’ve discussed blobs, the most numerous Git objects. They are leaves in the DAG, because their hashes do not depend on any other Git object. The next step up is a tree, which represents a collection of files and directories. A tree has a list of triplets, each with a Unix mode, a filename, and a hash, which points to a blob or another tree. These things are hashed together to get the tree’s hash.

A commit has a tree, several parent commits, and some commit metadata. These are again hashed together to form the commit’s hash. Commit hashes are the most common kind of hash you’ll hear when discussing Git. Because a commit’s hash depends on the hash of everything it contains, it lends verifiability to the integrity of its data, which a Git client can check.

XII
The staging area is moving.
The working directory must be active.

The presence of the staging area is one of Git’s strongest design choices. A subset of the changes from the working directory must be staged, and only that subset gets committed. I find this useful, but learning how to use it required wrapping my head around the file status lifecycle.

This is a design choice, though, and there are alternatives. For example, Jujutsu removes the staging area, with everything being committed-by-default. Subsetting changes is done after-the-fact via jj split. This strikes me as strange, but I do wonder: in the world where I grew up with Jujutsu instead of Git, would I have found that model more natural?

XIII
It was evening in the bazaar.
There were emails
And there were going to be emails.
The Git repository sat
In the open-source community.

Many free software projects collaborate over email. Git was first created for the Linux kernel, which works over a plethora of mailing lists, and so Git comes with a lot of email tools. While development over email and mailing lists it certainly old-school, it exemplifies the bazaar model that I think of as essential to open-source software.

Work on Git itself also happens via mailing list, an ouroboros pulling itself up by the bootstraps.

Comments

Loading...