Backpressure, Buffers, and Logging Sidecars

Published: (February 15, 2026 at 02:32 AM EST)
5 min read
Source: Dev.to

Source: Dev.to

9 PM – The Crash That Started It All

I was casually scrolling through YouTube when a Slack notification interrupted the quiet: our logging sidecar had crashed – exit code 137.

At first glance it looked like a simple back‑pressure issue. “Increase a buffer, maybe tweak a limit, redeploy.”

It wasn’t.

That crash sent me down a rabbit hole into Fluent Bit’s chunk lifecycle, filesystem buffering, and the realization that even something as “simple” as a logging sidecar behaves like a miniature distributed system – with all the trade‑offs that implies.

In this post I’d like to share the experiences and learnings from that journey.

Our Setup

  • Fluent Bit runs as a sidecar alongside each application container.
  • Its responsibility is straightforward: collect logs from the application, process them, and forward them to an external logging platform.

Why Fluent Bit Can Go OOM

Most OOM cases boil down to an imbalance within the pipeline:

CauseEffect
Slow outputs – downstream systems lag while inputs keep ingesting logsBack‑pressure builds → memory usage rises
Heavy filtering or processing – filters temporarily increase the in‑memory footprintSame as above
Unbounded ingestion vs. buffer limits – log volume exceeds configured memory limitsSidecar becomes the bottleneck

The Visibility Problem

One of the tricky things about abrupt container crashes is the lack of visibility. When the sidecar was killed we lost the very signals we needed to debug it. Our container‑level memory graphs looked stable, which made the crash even more confusing.

Later we realized we weren’t observing memory in real time.

How we finally saw the spikes

  1. Enabled the Memory Input Plugin.
  2. Inspected internal metrics through Fluent Bit logs.

That’s when we finally saw it – sudden memory spikes that weren’t visible in our external monitoring.

Experimentation Timeline

Iteration 1 – “Will Limiting Memory Fix It?”

# Example: cap memory per input
Mem_Buf_Limit 5M
  • What we expected: limit memory → prevent OOM.
  • What actually happened:
    • When the limit is reached the input gets paused.
    • The application doesn’t pause; it keeps emitting logs.
    • Because we were using the forward input plugin (no upstream persistence), paused input meant log loss.

Result: memory pressure was controlled, but at the cost of reliability.

Iteration 2 – “Can Disk Buffering Save Us?”

# Enable filesystem buffering
storage.type filesystem
  • Goal: shift pressure from RAM to disk, keep durability.
  • Outcome: the sidecar still crashed with an OOM error.

Disk buffering alone wasn’t enough.

Iteration 3 – “Where Else Is Memory Going?”

We realized that inputs aren’t the only RAM consumers – filters and outputs also eat memory.

ComponentMemory Impact
Filters / Parsers (e.g., JSON, Multiline)Hold records temporarily; JSON parser was a major contributor in our case.
OutputsMay compress, retry, or reload chunks from disk. Even with filesystem buffering, chunks must be brought back into memory before flushing.

What we tried

  • Increased container CPU & memory – obvious but defeats the “lightweight sidecar” goal.
  • Moved part of the parsing downstream – reduced parser memory pressure.
  • Tuned output backlog limits – limited how many chunks are loaded into RAM at once.

Things remained stable… until the next crash.

A New Symptom – Exit Code 139 (Segmentation Fault)

[2019/01/09 17:06:01] [error] [plugins/in_forward/forward_fs.c:218 errno=28] No space left on device
[2019/01/09 17:06:01] [error] [in_forward] could not register file into fs_events

The container exited with 139 – a segfault.
The logs revealed errno=28 → “No space left on device.”

How a logging sidecar exhausted disk

  • ECS Fargate gives each task 20 GB of ephemeral storage – seemingly plenty for buffering.
  • Comparing input ingestion metrics with output flush metrics showed a clear pattern: ingestion rate ≫ flush rate.

What actually happened

  1. Memory buffer fills up.
  2. Chunks spill over to disk.
  3. Outputs load chunks from disk and attempt to flush.
  4. A sudden ingestion spike occurs while the flush rate stays steady (or capped).
  5. Disk usage grows faster than it can be drained → disk fills → ENOSPC → segfault.

Given enough sustained imbalance, filling the disk is not just possible – it’s inevitable.

Rate‑Limiting & Mitigation Strategies

LayerAction
Application- Sampling
- Deduplication
- Enforce strict production log levels
Fluent Bit- Cap disk usage per output
- Drop oldest chunks when limits are hit
- Apply a throttle filter for burst control
Operations- Alert when ingestion consistently exceeds processing capacity
- Increase ephemeral storage (short‑term buffer, not a permanent fix)

Takeaways

  • Memory pressure isn’t the only problem – disk exhaustion can surface under the same imbalance.
  • A logging sidecar is effectively a mini distributed system; you must treat its pipeline (input → filter → output) as a whole.
  • Real‑time observability (memory, disk, ingestion/flush rates) is essential to catch imbalance before it kills the container.
  • Tuning one knob (e.g., Mem_Buf_Limit) without considering downstream effects can lead to data loss or new failures.

By iterating through the three experiments, exposing hidden metrics, and finally adding proper rate‑limiting and disk‑usage guards, we turned a flaky, crash‑prone sidecar into a reliable component of our logging pipeline.

Observations

  • physics.
  • When you generate data faster than you can move it, pressure builds.
  • In our case — it escaped through disk exhaustion.
  • Logging is infrastructure. It deserves guardrails.
  • If you don’t design for bursts, bursts will design your outage.
  • Curious how you’d approach it differently.
0 views
Back to Blog

Related posts

Read more »