A peek under the hood of Git’s binary formats

Git is one of the most efficient pieces of software out there. As you may know, I wanted to understand Git at a level that goes beyond just using it every day. It’s one thing to know how to push a commit or resolve a merge conflict, but what’s actually going on under the hood? To figure that out, I took a deep dive into Git’s binary formats.

At its core, Git is a content-addressable file system. Every file, commit, and directory in Git is stored as an object. These objects are categorized into blobs (for file content), trees (for directories), and commits (to tie everything together). What makes Git powerful is how it stores and compresses these objects efficiently behind the scenes.

How git compresses (almost) everything

Most of what Git deals with—blobs, commits, trees—is compressed with zlib. The beauty of this design is that the compression happens behind the scenes. You don’t have to think about it, but you can trust that it’s making Git efficient.

But let’s not take this efficiency for granted. To really appreciate it, you need to get hands-on. Here’s how you can manually decompress a Git blob and see exactly how Git stores your data.

Decompressing a Git blob: a hands-on exploration

If you’re the kind of person who likes to know how things work—down to the byte—this part is for you. Git uses zlib to compress blobs, which are the raw file data Git stores. Let’s take a closer look at how you can unpack one of these blobs and understand what’s inside.

Create a simple file:

echo "Hello, World!" > foo.txt

Generate a hash for the file: Git hashes the content into a unique SHA-1, which acts as the blob ID.

git hash-object foo.txt
# Output: 8ab686eafeb1f44702738c8b0f24f2567c36da6d

Add the file to the index:

git add foo.txt

Decompress the object: To inspect the blob, you need to decompress it manually. Use zlib-flate and xxd for this:

zlib-flate -uncompress < .git/objects/8a/b686eafeb1f44702738c8b0f24f2567c36da6d | xxd

Examine the output: The raw output reveals the internal structure of the blob, including its type, size, and actual content:

00000000: 626c 6f62 2031 3400 4865 6c6c 6f2c 2057  blob 14.Hello, W
00000010: 6f72 6c64 210a                           orld!.

The first part—blob 14—is Git telling you this is a blob object of size 14. After that comes the file content: “Hello, World!”. Separating the two is the null terminator—essentially two zero bytes. Simple, but now you’ve seen exactly how Git stores this at the byte level.

The uncompressed index: why it matters

The .git/index is where Git tracks the files you’ve staged for commit. It’s fast, but not because it’s complex—it’s fast because it’s simple. Unlike other Git objects, it’s not compressed. This makes perfect sense. Compression introduces latency, and since Git is constantly reading and writing to the index, it needs to be as responsive as possible. In a sense, the index is optimized for immediacy, not efficiency. That’s a trade-off you can afford when disk space is cheap and speed matters more.

Looking at the index: why you’ll see garbage

If you were to apply the same decompression and hex-dumping techniques to the .git/index file, you’d quickly notice something: a lot of it looks like garbage. That’s because the .git/index isn’t meant to be human-readable, and when you try to interpret raw binary data as ASCII, it’s going to spit out a mess.

Git stores metadata about the files in the index—file paths, modification times, permissions, and more. Much of that data isn’t ASCII, so dumping it into a hex viewer isn’t going to give you a clean output like we saw with the blob.

To make sense of it, you’d either need to write a quick script that parses the binary structure or lean on Git’s internal plumbing commands, like git ls-files --stage, to get a cleaner view of what’s inside.

git ls-files --stage
100644 f82f58ec77fde25ab0fc9fc0ca71665534c63ec7 0       test.md
100644 8ab686eafeb1f44702738c8b0f24f2567c36da6d 0       yeh.txt

But if you want to get your hands dirty and break it down yourself, there’s a certain satisfaction in extracting the meaning from the binary, even if it’s more tedious. After all, tools abstract the complexity, but sometimes the complexity is where you really learn how something works.

Now that you’ve seen how Git handles the index, you might wonder why it doesn’t store object data more efficiently. Why use ASCII for content size instead of binary?

Sure, binary would save space and shave off a few CPU cycles. You could even skip the null terminator hunt with fixed-width object type encoding. Valid points.

But Git’s making the right trade-off here. This isn’t about squeezing out tiny performance gains. How often are you looking up these objects compared to reading and writing the index? Not often enough to justify the change.

Git’s optimizing for extensibility, not marginal performance gains. It’s classic hacker thinking—don’t optimize prematurely. Keep your options open.

The takeaway

Re-implementing Git in Go is the kind of project that forces you to appreciate how tightly optimized Git is. It’s a masterclass in trade-offs: speed versus space, simplicity versus power. Understanding these binary formats is like opening a black box and realizing there’s nothing magical inside—just smart design.

If you want to really understand something, don’t just use it—break it apart. That’s where the real learning happens.

If you want to be notified about the next article about rolling your own basic Git in go, sign up to the newsletter here.

As always, you can reach me on tweeter, linkedin or email.