Git Data Storage

Lat­er we are going to use some of the low­er lev­el com­mands in Git and cre­ate Git data objects. To pre­pare our­selves, let’s review three pieces of Git:

We’ll talk about the fol­low­ing pieces:

  • Blob
  • Tree
  • Com­mit

What is stored in these data objects? The con­tents and changes of the repository.

Blob #

We’ll start with a Blob.

A blob is where Git stores the con­tents of a file that it track­ing as part of the repository. 

The file is ref­er­enced using a 40 char­ac­ter SHA1 hash made from the con­tents of the blob (the con­tents of the file). You’ve seen these before when ref­er­enc­ing com­mits. Git uses SHA1 hash­es for track­ing all data in its repos­i­to­ry. It guar­an­tees a unique id for each data object.

One side effect of using only the con­tents of the file in a blob is that oth­er files with the same con­tents just ref­er­ence the same blob and don’t have to be store twice.

So a blob, with a unique id, which is cre­at­ed with a SHA1 hash of the con­tents of the blog, points to an actu­al file.

[blob 3837d8] — [index.html]

Using an SHA1 hash

Git doesn’t use the SHA1 hash to secure any­thing. It uses it because it’s con­tent-address­able” stor­age. This means that it can be tracked and retrieved based on its con­tent, not its location.

Tree #

Like a blob is a rep­re­sen­ta­tion of a file, a tree is real­ly a rep­re­sen­ta­tion of a file sys­tem object but with a dif­fer­ent name. For our pur­pos­es, we can think of a tree as a direc­to­ry. It con­tains blobs (files) and oth­er trees (sub­di­rec­to­ries). And the trees inside the tree can con­tain both as well.

[TREE] / [blob]
       — [TREE] / [blob]
                \ [blob]    
       \ [blob]

Just like with a blob, a tree con­tains the con­tents of the direc­to­ry it ref­er­ences (which would be point­ers to oth­er blobs and trees) and is iden­ti­fied with a SHA1 hash. 

Com­mit #

A com­mit is a snap­shot of what the tree looked like at any giv­en time. 

The HEAD in a repos­i­to­ry is just a point­er to a com­mit, which is the object that store the state of the repos­i­to­ry when that com­mit object was created.

[COMMIT] —    [TREE] / [blob]
                   — [TREE] / [blob]
                            \ [blob]    
                   \ [blob]

Com­mits are orga­nized in a one-way col­lec­tion (direct­ed acyclic graph) and rep­re­sent the his­to­ry of your changes in the repository.

A com­mit object con­tain the following:

  • a hash of the tree object that con­tains the commit
  • the name of the author who cre­at­ed the new version
  • the name of the per­son who cre­at­ed the com­mit object (usu­al­ly the same as the per­son who cre­at­ed the new version)
  • the com­mit message

All of those togeth­er are the com­mit object and the object’s SHA1 hash is based on those. This makes com­mits unique while also keep­ing the pieces of the com­mit separate.