How Jenkins Slowly Drained Our EFS Burst Credits Over 2 Weeks
Source: Dev.to
TL;DR
Our Jenkins started failing on 1/26, but the root cause began on 1/13. We discovered three compounding issues:
- Shared Library cache was disabled (existing issue)
- Switched to disposable agents (1/13 change)
- Increased build frequency (New Year effect)
Result: ≈ 50× increase in metadata IOPS → EFS burst credits drained over two weeks.
Why You Should Care
If you’re running Jenkins on EFS, this can happen to you. The symptoms appear suddenly, but the root cause often starts weeks earlier. Time‑series analysis of metrics is crucial.
The Mystery: Symptoms vs. Root Cause
Previously, I wrote about how Jenkins became slow and Git clones started failing. We found ~15 GB of Git temporary files (tmp_pack_*) accumulated on EFS, causing metadata IOPS exhaustion.
We fixed it with Elastic Throughput and cleanup jobs. Case closed, right?
Not quite.
When I checked the EFS Burst Credit Balance graph, I noticed something important:
The credit started declining around 1/13, but symptoms appeared on 1/26.
Timeline
| Date | Event |
|---|---|
| 1/13 | Credit decline starts |
| 1/19 | Rapid decline |
| 1/26 | Credit bottoms out |
| 1/26‑27 | Symptoms appear |
The tmp_pack_* accumulation was a symptom, not the root cause. Something changed on 1/13.
What Changed on 1/13?
Honestly, this stumped me. I had a few ideas, but nothing definitive:
1. Agent Architecture Change
Around 1/13 we changed our Jenkins agent strategy:
| Before (Shared Agents) | After (Disposable Agents) |
|---|---|
| EC2 type: c5.large, etc. | EC2 type: t3.small, etc. |
| Multiple jobs share agents | One agent per job, destroyed after use |
| Workspace reuse | Full git clone every time |
git pull for incremental updates | git clone for full clones every time |
The goal was cost reduction. We didn’t consider the metadata IOPS impact.
2. Post‑New‑Year Development Rush
Teams ramped up development after the New Year holiday, increasing overall Jenkins load.
The Math: 50× Metadata IOPS Increase
Builds per day: 50 (estimated)
Files created per clone: 5,000
Shared‑agent approach:
Clone once = 5,000 metadata operations
Disposable‑agent approach:
50 builds × 5,000 files = 250,000 metadata operations/day
≈ 50× increase in metadata IOPS.
Add the New‑Year rush, and the numbers get even worse.
Understanding Git Cache in Jenkins
During investigation I noticed the directory:
/mnt/efs/jenkins/caches/git-3e9b32912840757a720f39230c221f0e
This is the Jenkins Git plugin’s bare‑repository cache.
How Git Caching Works
The plugin optimises clones by:
- Caching remote repos in
/mnt/efs/jenkins/caches/git-{hash}/as bare repositories. - Cloning to job workspaces using
git clone --referencefrom this cache. - Generating the hash from repo URL + branch.
Problem: Disposable agents may not benefit from this cache because they are new for every build.
The Smoking Gun: tmp_pack_* Location
I revisited where the tmp_pack_* files lived:
jobs/sample-job/jobs/sample-pipeline/builds/104/libs/
└── 335abf.../root/.git/objects/pack/
└── tmp_pack_WqmOyE ← 100‑300 MB
These are in per‑build directories:
jobs/sample-job/jobs/sample-pipeline/
└── builds/
├── 104/
│ └── libs/.../tmp_pack_WqmOyE
├── 105/
│ └── libs/.../tmp_pack_XYZ123
└── 106/
└── libs/.../tmp_pack_ABC456
Every build was re‑checking out the Pipeline Shared Library, generating tmp_pack_* each time.
Question: Why is the Shared Library being fetched on every build?
Root Cause: Cache Setting Was OFF
In Jenkins configuration I found the smoking gun:
The Shared Library setting “Cache fetched versions on controller for quick retrieval” was unchecked.
Consequences:
- Shared Library cache completely disabled.
- Full fetch from remote repository on every build.
- Temporary files generated in
.git/objects/pack/. - Massive metadata IOPS consumption.
The Fix: Enable Caching
- Enable “Cache fetched versions on controller for quick retrieval”.
- Set Refresh time in minutes to 180 minutes.
Choosing the Refresh Time
| Refresh interval | Effect |
|---|---|
| 60‑120 min | Fast updates, moderate IOPS reduction |
| 180 min (3 h) | Balanced – ~8 updates/day |
| 360 min (6 h) | Stable – ~4 updates/day |
| 1440 min (24 h) | Maximum IOPS reduction |
Why 180 min?
- Updates check ~8 times/day (9 am, 12 pm, 3 pm, 6 pm…).
- Shared Library changes reflected within half a day is acceptable.
- Significant IOPS reduction (once per 3 h instead of every build).
- Urgent changes can be forced via the “force refresh” feature.
I documented this in our runbook so we don’t forget.
Measuring the Impact
| Period | Expected Observation |
|---|---|
| Short‑term (24‑48 h) | No new tmp_pack_* files; metadata IOPS drop |
| Mid‑term (1 week) | Burst Credit Balance recovery trend; stable build performance |
| Long‑term (1 month) | Credits remain stable; no recurrence |
Lessons Learned
1. Symptoms ≠ Root Cause Timeline
- Symptom appearance: 1/26‑1/27
- Root cause: Around 1/13
- Credit depletion: Gradual over two weeks
Time‑series analysis is crucial. Fixing only visible symptoms leads to superficial solutions.
2. Architecture Changes Have Hidden Costs
The disposable‑agent change reduced EC2 costs but created problems elsewhere.
When changing architecture:
- Evaluate performance impact beforehand.
- Set up appropriate monitoring.
3. EFS Metadata IOPS Characteristics
- Mass creation/deletion of small files is deadly.
- File count matters more than storage size.
- Burst mode requires credit management.
- Credit depletion happens gradually.
Especially with
.git/objects/containing thousands of small files, behavior differs drastically from normal file I/O.
4. Compound Root Causes
This issue wasn’t a single cause but three factors combining:
- Shared Library cache disabled (pre‑existing)
- Disposable agent switch (1/13)
- Increased builds (New Year)
Each alone might not have caused major issues, but together they exceeded the critical threshold.
Open Questions
While we enabled Shared Library caching, we’re still using disposable agents.
Can agent‑side Git cache be utilized effectively with disposable agents?
Possible approaches:
- Share EFS Git cache across all agents
- Extend agent lifecycle slightly for reuse across jobs
- Cache in S3 and sync on startup
Finding the right balance between cost and performance remains a challenge.
I write more about technical decision‑making and engineering practices on my blog. Check it out.