How git clone Really Works: A Deep Dive into Git’s Object Database
Source: Dev.to
What git clone Actually Does
Git performs the following steps:
- Negotiates with the remote to discover available references (branches, tags).
- Downloads the full object graph — all commits, trees, and blobs reachable from those references — efficiently packed and delta‑compressed.
- Writes these objects into
.git/objects/pack/, sets up local refs andHEAD, and then checks out a working directory from the root tree of the checked‑out commit.
In essence:
clone = copy the object graph + set references + checkout the working tree
The Git Object Model: Core Building Blocks
Git is a content‑addressed database, not a traditional filesystem. Every file, directory, commit, and tag exists as an immutable object identified by a cryptographic hash (SHA‑1 or SHA‑256). This makes Git’s data model tamper‑evident, deduplicated, and verifiable.
| Type | Purpose | Contains |
|---|---|---|
| Blob | File data | Raw bytes and a header |
| Tree | Directory snapshot | Mode, name, and object IDs for children |
| Commit | Snapshot metadata | Author, message, parent commits, root tree |
| Tag | Annotated reference | Tag message and pointer |
The Object Graph
commit C
│ tree -> T_root
│ ├── mode 100644 "README.md" -> blob B1
│ ├── mode 100755 "build.sh" -> blob B2
│ └── mode 040000 "src" -> tree T_src
│ ├── "main.go" -> blob B3
│ └── "util.go" -> blob B4
│
└── parent -> commit P
│ tree -> T_prev
└── parent -> ...
Key ideas
- A commit points to a tree, which represents a snapshot of the repository.
- Trees point to blobs (files) or other subtrees (directories).
- Commits form a Directed Acyclic Graph (DAG) through parent references.
- Identical content produces identical hashes, so Git automatically reuses objects.
How git clone Communicates with the Remote
The clone operation is a structured conversation between your Git client and the remote server.
Advertisement Phase
The remote server advertises:
- Its available references (e.g.,
refs/heads/main,refs/tags/v1.0) - Supported capabilities (e.g.,
side-band,ofs-delta,multi_ack)
Negotiation Phase
The client responds with:
- Wants: commits it needs
- Haves: commits it already has (for incremental clones)
The server analyzes the commit graph to determine exactly which objects the client lacks.
Packfile Transfer Phase
The server:
- Gathers all reachable objects from the requested commits
- Delta‑compresses them for efficient transfer
- Streams a single
.packfile to the client
The client writes this pack into:
.git/objects/pack/pack-XXXX.pack
.git/objects/pack/pack-XXXX.idx
Protocol Flow Overview
Client Server
| ls-refs |
|------------------------------>|
| refs + capabilities |
||
| have(s) |
|------------------------------>|
| ACK/NAK + pack |
| "ref: refs/heads/main"
├── config -> [remote "origin"]
├── refs
│ ├── heads/main
│ ├── remotes/origin/main
│ └── tags/
└── objects
├── pack/
│ ├── pack-XYZ.pack
│ └── pack-XYZ.idx
└── info/
Key components
.git/objects/pack: packed object store.git/refs/heads: local branches.git/refs/remotes/origin: remote‑tracking branches.git/index: staging cache.git/HEAD: symbolic reference to the current branch
How Git Checkout Creates Files
The checkout process transforms database objects into real files:
- Read
HEAD→ resolve branch → resolve commit - Read the commit’s root tree
- Traverse the tree and write each blob to the working directory
- Cache path–blob mappings in the index
HEAD -> refs/heads/main -> commit C -> tree T_root
|-> blobs -> files
Working tree base OBJ_A]
[OBJ_C full]
...
[checksum]
This mechanism significantly reduces both disk usage and network transfer size.
Data Integrity and Security
- Every object’s hash covers both its header and content—change any byte, and the hash changes.
- Commits link via parent hashes, creating a verifiable chain of trust.
- Tools such as
git fsckandgit verify-packdetect corruption. - Signed commits and tags add cryptographic authenticity.
Git’s security model is mathematical: integrity is guaranteed by hash linkage.
Example: Minimal Repository Flow
- Initial commit
C0→ treeT0→ blobB1(README) - Next commit
C1→ modifies README → blobB2 - Server packs
{C1, C0, T1, T0, B2, B1} - Client writes pack → sets refs → checks out
C1→ files appear
Visual summary
refs/heads/main -> C3 -> C2 -> C1 -> C0
Each commit points to its root tree; trees link to blobs; references point to commits—forming a single, content‑addressed DAG.
Key Mental Models
- Git is a database, not a filesystem. Every file, directory, and commit is an immutable object in a key–value store.
- Cloning = graph download + reference binding. You fetch an object graph, then assign human‑readable names (branches, tags).
- The working tree = a view of one tree object. Switching branches simply changes which tree object you’re viewing.
- The index = a performance cache. It speeds up diffing and staging by tracking file stats and blob IDs.
Closing Thoughts
git clone doesn’t just copy files. It reconstructs a graph‑based database of snapshots, hashes, and relationships. Understanding this process gives you a more predictable, transparent view of how Git actually manages your code—and why it’s so efficient at doing so.