A peek under the hood of Git’s binary formats
Git is one of the most efficient pieces of software out there. As you may know, I wanted to understand Git at a level that goes beyond just using it every day. It’s one thing to know how to push a commit or resolve a merge conflict, but what’s actually going on under the hood? To figure that out, I took a deep dive into Git’s binary formats.
At its core, Git is a content-addressable file system. Every file, commit, and directory in Git is stored as an object. These objects are categorized into blobs (for file content), trees (for directories), and commits (to tie everything together). What makes Git powerful is how it stores and compresses these objects efficiently behind the scenes.
How git compresses (almost) everything
Most of what Git deals with—blobs, commits, trees—is compressed with zlib. The beauty of this design is that the compression happens behind the scenes. You don’t have to think about it, but you can trust that it’s making Git efficient.
But let’s not take this efficiency for granted. To really appreciate it, you need to get hands-on. Here’s how you can manually decompress a Git blob and see exactly how Git stores your data.
Decompressing a Git blob: a hands-on exploration
If you’re the kind of person who likes to know how things work—down to the byte—this part is for you. Git uses zlib to compress blobs, which are the raw file data Git stores. Let’s take a closer look at how you can unpack one of these blobs and understand what’s inside.
Create a simple file:
echo "Hello, World!" > foo.txt
Generate a hash for the file: Git hashes the content into a unique SHA-1, which acts as the blob ID.
git hash-object foo.txt
# Output: 8ab686eafeb1f44702738c8b0f24f2567c36da6d
Add the file to the index:
git add foo.txt
Decompress the object: To inspect the blob, you need to decompress it manually. Use zlib-flate
and xxd
for this:
zlib-flate -uncompress < .git/objects/8a/b686eafeb1f44702738c8b0f24f2567c36da6d | xxd
Examine the output: The raw output reveals the internal structure of the blob, including its type, size, and actual content:
00000000: 626c 6f62 2031 3400 4865 6c6c 6f2c 2057 blob 14.Hello, W
00000010: 6f72 6c64 210a orld!.
The first part—blob 14—is Git telling you this is a blob object of size 14. After that comes the file content: “Hello, World!”. Separating the two is the null terminator—essentially two zero bytes. Simple, but now you’ve seen exactly how Git stores this at the byte level.
The uncompressed index: why it matters
The .git/index
is where Git tracks the files you’ve staged for commit. It’s fast, but not because it’s complex—it’s fast because it’s simple. Unlike other Git objects, it’s not compressed. This makes perfect sense. Compression introduces latency, and since Git is constantly reading and writing to the index, it needs to be as responsive as possible. In a sense, the index is optimized for immediacy, not efficiency. That’s a trade-off you can afford when disk space is cheap and speed matters more.
Looking at the index: why you’ll see garbage
If you were to apply the same decompression and hex-dumping techniques to the .git/index
file, you’d quickly notice something: a lot of it looks like garbage. That’s because the .git/index
isn’t meant to be human-readable, and when you try to interpret raw binary data as ASCII, it’s going to spit out a mess.
Git stores metadata about the files in the index—file paths, modification times, permissions, and more. Much of that data isn’t ASCII, so dumping it into a hex viewer isn’t going to give you a clean output like we saw with the blob.
To make sense of it, you’d either need to write a quick script that parses the binary structure or lean on Git’s internal plumbing commands, like git ls-files --stage
, to get a cleaner view of what’s inside.
git ls-files --stage
100644 f82f58ec77fde25ab0fc9fc0ca71665534c63ec7 0 test.md
100644 8ab686eafeb1f44702738c8b0f24f2567c36da6d 0 yeh.txt
But if you want to get your hands dirty and break it down yourself, there’s a certain satisfaction in extracting the meaning from the binary, even if it’s more tedious. After all, tools abstract the complexity, but sometimes the complexity is where you really learn how something works.
Now that you’ve seen how Git handles the index, you might wonder why it doesn’t store object data more efficiently. Why use ASCII for content size instead of binary?
Sure, binary would save space and shave off a few CPU cycles. You could even skip the null terminator hunt with fixed-width object type encoding. Valid points.
But Git’s making the right trade-off here. This isn’t about squeezing out tiny performance gains. How often are you looking up these objects compared to reading and writing the index? Not often enough to justify the change.
Git’s optimizing for extensibility, not marginal performance gains. It’s classic hacker thinking—don’t optimize prematurely. Keep your options open.
The takeaway
Re-implementing Git in Go is the kind of project that forces you to appreciate how tightly optimized Git is. It’s a masterclass in trade-offs: speed versus space, simplicity versus power. Understanding these binary formats is like opening a black box and realizing there’s nothing magical inside—just smart design.
If you want to really understand something, don’t just use it—break it apart. That’s where the real learning happens.
If you want to be notified about the next article about rolling your own basic Git in go, sign up to the newsletter here.