How git clone Really Works: A Deep Dive into Git’s Object Database

Published: 1 month ago (December 11, 2025 at 01:22 AM EST)

4 min read

Source: Dev.to

What git clone Actually Does

Git performs the following steps:

Negotiates with the remote to discover available references (branches, tags).
Downloads the full object graph — all commits, trees, and blobs reachable from those references — efficiently packed and delta‑compressed.
Writes these objects into .git/objects/pack/, sets up local refs and HEAD, and then checks out a working directory from the root tree of the checked‑out commit.

In essence:

clone = copy the object graph + set references + checkout the working tree

The Git Object Model: Core Building Blocks

Git is a content‑addressed database, not a traditional filesystem. Every file, directory, commit, and tag exists as an immutable object identified by a cryptographic hash (SHA‑1 or SHA‑256). This makes Git’s data model tamper‑evident, deduplicated, and verifiable.

Type	Purpose	Contains
Blob	File data	Raw bytes and a header
Tree	Directory snapshot	Mode, name, and object IDs for children
Commit	Snapshot metadata	Author, message, parent commits, root tree
Tag	Annotated reference	Tag message and pointer

The Object Graph

commit C
│  tree -> T_root
│            ├── mode 100644 "README.md" -> blob B1
│            ├── mode 100755 "build.sh"  -> blob B2
│            └── mode 040000 "src"       -> tree T_src
│                                                ├── "main.go" -> blob B3
│                                                └── "util.go" -> blob B4
│
└── parent -> commit P
               │ tree -> T_prev
               └── parent -> ...

Key ideas

A commit points to a tree, which represents a snapshot of the repository.
Trees point to blobs (files) or other subtrees (directories).
Commits form a Directed Acyclic Graph (DAG) through parent references.
Identical content produces identical hashes, so Git automatically reuses objects.

How git clone Communicates with the Remote

The clone operation is a structured conversation between your Git client and the remote server.

Advertisement Phase

The remote server advertises:

Its available references (e.g., refs/heads/main, refs/tags/v1.0)
Supported capabilities (e.g., side-band, ofs-delta, multi_ack)

Negotiation Phase

The client responds with:

Wants: commits it needs
Haves: commits it already has (for incremental clones)

The server analyzes the commit graph to determine exactly which objects the client lacks.

Packfile Transfer Phase

The server:

Gathers all reachable objects from the requested commits
Delta‑compresses them for efficient transfer
Streams a single .pack file to the client

The client writes this pack into:

.git/objects/pack/pack-XXXX.pack
.git/objects/pack/pack-XXXX.idx

Protocol Flow Overview

Client                          Server
  |          ls-refs              |
  |------------------------------>|
  |       refs + capabilities     |
  ||
  |           have(s)             |
  |------------------------------>|
  |        ACK/NAK + pack         |
  | "ref: refs/heads/main"
├── config               -> [remote "origin"]
├── refs
│   ├── heads/main
│   ├── remotes/origin/main
│   └── tags/
└── objects
    ├── pack/
    │   ├── pack-XYZ.pack
    │   └── pack-XYZ.idx
    └── info/

Key components

.git/objects/pack: packed object store
.git/refs/heads: local branches
.git/refs/remotes/origin: remote‑tracking branches
.git/index: staging cache
.git/HEAD: symbolic reference to the current branch

How Git Checkout Creates Files

The checkout process transforms database objects into real files:

Read HEAD → resolve branch → resolve commit
Read the commit’s root tree
Traverse the tree and write each blob to the working directory
Cache path–blob mappings in the index

HEAD -> refs/heads/main -> commit C -> tree T_root
                                   |-> blobs -> files
Working tree  base OBJ_A]
[OBJ_C full]
...
[checksum]

This mechanism significantly reduces both disk usage and network transfer size.

Data Integrity and Security

Every object’s hash covers both its header and content—change any byte, and the hash changes.
Commits link via parent hashes, creating a verifiable chain of trust.
Tools such as git fsck and git verify-pack detect corruption.
Signed commits and tags add cryptographic authenticity.

Git’s security model is mathematical: integrity is guaranteed by hash linkage.

Example: Minimal Repository Flow

Initial commit C0 → tree T0 → blob B1 (README)
Next commit C1 → modifies README → blob B2
Server packs {C1, C0, T1, T0, B2, B1}
Client writes pack → sets refs → checks out C1 → files appear

Visual summary

refs/heads/main -> C3 -> C2 -> C1 -> C0

Each commit points to its root tree; trees link to blobs; references point to commits—forming a single, content‑addressed DAG.

Key Mental Models

Git is a database, not a filesystem. Every file, directory, and commit is an immutable object in a key–value store.
Cloning = graph download + reference binding. You fetch an object graph, then assign human‑readable names (branches, tags).
The working tree = a view of one tree object. Switching branches simply changes which tree object you’re viewing.
The index = a performance cache. It speeds up diffing and staging by tracking file stats and blob IDs.

Closing Thoughts

git clone doesn’t just copy files. It reconstructs a graph‑based database of snapshots, hashes, and relationships. Understanding this process gives you a more predictable, transparent view of how Git actually manages your code—and why it’s so efficient at doing so.

Link to original article