How Jenkins Slowly Drained Our EFS Burst Credits Over 2 Weeks

Published: (January 30, 2026 at 10:58 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

TL;DR

Our Jenkins started failing on 1/26, but the root cause began on 1/13. We discovered three compounding issues:

  • Shared Library cache was disabled (existing issue)
  • Switched to disposable agents (1/13 change)
  • Increased build frequency (New Year effect)

Result: ≈ 50× increase in metadata IOPS → EFS burst credits drained over two weeks.

Why You Should Care

If you’re running Jenkins on EFS, this can happen to you. The symptoms appear suddenly, but the root cause often starts weeks earlier. Time‑series analysis of metrics is crucial.

The Mystery: Symptoms vs. Root Cause

Previously, I wrote about how Jenkins became slow and Git clones started failing. We found ~15 GB of Git temporary files (tmp_pack_*) accumulated on EFS, causing metadata IOPS exhaustion.

We fixed it with Elastic Throughput and cleanup jobs. Case closed, right?

Not quite.

When I checked the EFS Burst Credit Balance graph, I noticed something important:

The credit started declining around 1/13, but symptoms appeared on 1/26.

Timeline

DateEvent
1/13Credit decline starts
1/19Rapid decline
1/26Credit bottoms out
1/26‑27Symptoms appear

The tmp_pack_* accumulation was a symptom, not the root cause. Something changed on 1/13.

What Changed on 1/13?

Honestly, this stumped me. I had a few ideas, but nothing definitive:

1. Agent Architecture Change

Around 1/13 we changed our Jenkins agent strategy:

Before (Shared Agents)After (Disposable Agents)
EC2 type: c5.large, etc.EC2 type: t3.small, etc.
Multiple jobs share agentsOne agent per job, destroyed after use
Workspace reuseFull git clone every time
git pull for incremental updatesgit clone for full clones every time

The goal was cost reduction. We didn’t consider the metadata IOPS impact.

2. Post‑New‑Year Development Rush

Teams ramped up development after the New Year holiday, increasing overall Jenkins load.

The Math: 50× Metadata IOPS Increase

Builds per day: 50 (estimated)
Files created per clone: 5,000

Shared‑agent approach:
  Clone once = 5,000 metadata operations

Disposable‑agent approach:
  50 builds × 5,000 files = 250,000 metadata operations/day

≈ 50× increase in metadata IOPS.
Add the New‑Year rush, and the numbers get even worse.

Understanding Git Cache in Jenkins

During investigation I noticed the directory:

/mnt/efs/jenkins/caches/git-3e9b32912840757a720f39230c221f0e

This is the Jenkins Git plugin’s bare‑repository cache.

How Git Caching Works

The plugin optimises clones by:

  1. Caching remote repos in /mnt/efs/jenkins/caches/git-{hash}/ as bare repositories.
  2. Cloning to job workspaces using git clone --reference from this cache.
  3. Generating the hash from repo URL + branch.

Problem: Disposable agents may not benefit from this cache because they are new for every build.

The Smoking Gun: tmp_pack_* Location

I revisited where the tmp_pack_* files lived:

jobs/sample-job/jobs/sample-pipeline/builds/104/libs/
  └── 335abf.../root/.git/objects/pack/
      └── tmp_pack_WqmOyE  ← 100‑300 MB

These are in per‑build directories:

jobs/sample-job/jobs/sample-pipeline/
└── builds/
    ├── 104/
    │   └── libs/.../tmp_pack_WqmOyE
    ├── 105/
    │   └── libs/.../tmp_pack_XYZ123
    └── 106/
        └── libs/.../tmp_pack_ABC456

Every build was re‑checking out the Pipeline Shared Library, generating tmp_pack_* each time.

Question: Why is the Shared Library being fetched on every build?

Root Cause: Cache Setting Was OFF

In Jenkins configuration I found the smoking gun:

The Shared Library setting “Cache fetched versions on controller for quick retrieval” was unchecked.

Consequences:

  • Shared Library cache completely disabled.
  • Full fetch from remote repository on every build.
  • Temporary files generated in .git/objects/pack/.
  • Massive metadata IOPS consumption.

The Fix: Enable Caching

  1. Enable “Cache fetched versions on controller for quick retrieval”.
  2. Set Refresh time in minutes to 180 minutes.

Choosing the Refresh Time

Refresh intervalEffect
60‑120 minFast updates, moderate IOPS reduction
180 min (3 h)Balanced – ~8 updates/day
360 min (6 h)Stable – ~4 updates/day
1440 min (24 h)Maximum IOPS reduction

Why 180 min?

  • Updates check ~8 times/day (9 am, 12 pm, 3 pm, 6 pm…).
  • Shared Library changes reflected within half a day is acceptable.
  • Significant IOPS reduction (once per 3 h instead of every build).
  • Urgent changes can be forced via the “force refresh” feature.

I documented this in our runbook so we don’t forget.

Measuring the Impact

PeriodExpected Observation
Short‑term (24‑48 h)No new tmp_pack_* files; metadata IOPS drop
Mid‑term (1 week)Burst Credit Balance recovery trend; stable build performance
Long‑term (1 month)Credits remain stable; no recurrence

Lessons Learned

1. Symptoms ≠ Root Cause Timeline

  • Symptom appearance: 1/26‑1/27
  • Root cause: Around 1/13
  • Credit depletion: Gradual over two weeks

Time‑series analysis is crucial. Fixing only visible symptoms leads to superficial solutions.

2. Architecture Changes Have Hidden Costs

The disposable‑agent change reduced EC2 costs but created problems elsewhere.

When changing architecture:

  • Evaluate performance impact beforehand.
  • Set up appropriate monitoring.

3. EFS Metadata IOPS Characteristics

  • Mass creation/deletion of small files is deadly.
  • File count matters more than storage size.
  • Burst mode requires credit management.
  • Credit depletion happens gradually.

Especially with .git/objects/ containing thousands of small files, behavior differs drastically from normal file I/O.

4. Compound Root Causes

This issue wasn’t a single cause but three factors combining:

  1. Shared Library cache disabled (pre‑existing)
  2. Disposable agent switch (1/13)
  3. Increased builds (New Year)

Each alone might not have caused major issues, but together they exceeded the critical threshold.

Open Questions

While we enabled Shared Library caching, we’re still using disposable agents.

Can agent‑side Git cache be utilized effectively with disposable agents?

Possible approaches:

  • Share EFS Git cache across all agents
  • Extend agent lifecycle slightly for reuse across jobs
  • Cache in S3 and sync on startup

Finding the right balance between cost and performance remains a challenge.


I write more about technical decision‑making and engineering practices on my blog. Check it out.

Back to Blog

Related posts

Read more »