Moving from a traditional source control system like Perforce or Subversion to Git seems like it shouldn’t be that hard. In fact, Git is supposed to be very easy to use, so why is it that when I first started using Git, I kept getting tripped up? I had read Joel Spolsky’s Mercurial tutorial (not exactly like Git, but close enough) and thought I understood it. Joel also made it very clear that there was an unlearning process that had to take place, but I didn’t realize how much I needed to unlearn.
Git tries to use familiar concepts in an effort to make the transition easier, but I think it really does more harm than good. You end up with a situation where familiar terms are used to describe things that are similar only at a very superficial level. It doesn’t take long for the differences to trip you up.
I’m not going to focus on the differences between centralized vs. distributed version control systems (that is covered in other places and is fairly straightforward). I’m going to concentrate on which “old” concepts you need to unlearn, specifically branch, changeset, head, and checkout. Whatever you think these mean is very likely wrong in the world of Git. Read on and be enlightened.
The fundamental unit of data in Git is the snapshot. In most systems, when you commit changes, those changes are packaged into a changeset or something similar. Whenever you commit changes to Git, Git effectively creates a snapshot of your entire workspace in its current state. To save space, these snapshots only contain the delta between a “parent” snapshot and the new snapshot. Each snapshot contains a link to its parent snapshot. In the case of a merge, the snapshot contains a link to each of the two parent snapshots used to create this one.
So, a Git repository is essentially just a collection of snapshots connected by backward links to their parent snapshots. Here’s an example with 12 snapshots and the backward links from each to its parent:
Just to orient yourself, the oldest snapshots are at the bottom and the newest are at the top (some documentation shows this inverted, or horizontally).
Perhaps you’ll be surprised to learn that this diagram does not show any branches. Your first reaction may be “of course it does!” In fact, this figure shows three snapshot history paths, but in Git parlance, these are not called branches.
So, what is a branch? A branch is simply a named pointer to a snapshot. Let’s add a few to our example:
We now have two branches, “main” and “feature”. The hardest part to understand is that these branches are just pointers to a single snapshot. They don’t refer to the entire line of changes. Branches can be created, destroyed, and moved without affecting any of the snapshots. You can also have several branches pointing to the same snapshot. These combine to allow for easy changes to the branch names, the ability to “retroactively” create a new branch, etc.
You can also see in this diagram that there is not a branch corresponding to snapshot 8. Once you merge two code lines, you can safely delete the extra branch without affecting the repository. You can even recreate the branch at the same point in the future if you want. The only thing to be aware of, is that unreferenced snapshots (those without a branch or child snapshot) will eventually be deleted (Git has a built-in garbage collector). So, if we move the “feature” branch to snapshot 11, then snapshot 12 would eventually be garbage collected. Normally Git waits some time before deleting these orphaned snapshots to allow for situations where the user is simply moving branches around and has only temporarily orphaned a snapshot.
Let’s take a look at a very common situation. Let’s say you are developing a new feature, but forgot to create a new branch for your work and accidentally commit your changes to the main branch. After your commit, the repository looks like this:
where your new changes were committed as snapshot 13. This can easily be fixed by creating a new branch at snapshot 13, then moving the “main” branch back to snapshot 10.
The head is a pointer to your “current” branch. So, a branch is a pointer to a shapshot, and the head is a pointer to a branch:
The head is used primarily when you commit changes. When you commit changes, Git follows the head pointer to the branch, then follows that to a snapshot. It then creates a new snapshot as a child of that one and moves the branch to the new snapshot:
Since the head points to the branch, it effectively moves with it. If you want to change the head explicitly, you can do that by doing a checkout.
Performing a checkout on a branch does two things: it moves the head pointer to that branch, and updates your local workspace files to match the corresponding snapshot. If you wanted to switch to the “feature” branch, you can do a checkout of that branch to move the head pointer there and update your workspace files:
The key points are:
- Each commit creates a snapshot that effectively captures the state of your workspace at the time of the commit.
- A branch is simply a pointer to a snapshot.
- The head is a pointer to the current branch.
- Checkout moves the head to a new branch and updates your workspaces to match the corresponding snapshot.
It is also important to be aware that some Git tools introduce some complications into this. For example, the process of moving a branch pointer (using a reset command) may cause a checkout at the same time. There are usually command options to control these interactions. Another concept that you’ll have to deal with (which isn’t covered here) is how uncommitted changes to local files are handled during these operations.
The best part of Git is that you can try out commands on your local repository before pushing your changes to the server. Even if you completely mess up your local repository, you can simply throw it out and get a fresh copy from the server.