The Seven Deadly Sins of MCP: Operational Sins

Published: 1 month ago (March 30, 2026 at 07:13 PM EDT)

8 min read

Source: Dev.to

Source: Dev.to

Operational Sins: Sloth and Wrath

These sins belong in this category because they determine how a live MCP system behaves under stress: whether it fails truthfully, recovers sanely, and whether operators can trust what they see during an outage.

Both appear when systems are stressed.

Sloth hides problems behind vague errors, weak validation, or sloppy transport handling.
Wrath takes a survivable problem and amplifies it through blind retries, reconnect storms, and forceful reactions to uncertainty.

When access boundaries are tighter, the next question is how the system behaves when things go wrong. That is where these two sins take over.

The operational layer is especially visible in MCP because transport and protocol behavior are part of the product. Std‑io hygiene, reconnect behavior, resumability, notifications, and session handling are not side details once a model‑facing interface is live.

Sloth

Sloth is avoiding precise validation, specific errors, and basic operational hygiene. This sin rarely looks dramatic in a code review; it usually looks harmless:

A catch block hides detail.
A validation rule gets deferred.
A debug print goes to the wrong stream.

Nobody thinks they are making a dangerous choice—they think they are saving time. But MCP is unforgiving about sloppy boundaries.

How to spot it

Logs are vague, repetitive, or useless when something fails.
Operators keep seeing generic errors like “MCP error” instead of the real failure.
Std‑io integrations break mysteriously because non‑protocol output leaked into stdout.
The team spends more time reproducing failures than fixing them.

Example

This regularly shows up in support and operations queues. A rep asks the assistant to fetch customer cus_1234 while a customer is waiting in chat, or an engineer asks for the latest incident by ID during triage. In that moment, bad input, not found, and dependency outage are three distinct situations with distinct next steps. If the tool collapses them into one vague failure, the user loses the context they need to respond correctly.

Before

server.tool("get_customer", async ({ id }) => {
  try {
    return await db.customers.findById(id);
  } catch {
    throw new Error("MCP error");
  }
});

After

class ToolError extends Error {
  constructor(
    public code: "invalid_input" | "not_found" | "dependency_unavailable",
    message: string,
    public retryable: boolean
  ) {
    super(message);
  }
}

server.tool("get_customer", async ({ id }) => {
  if (typeof id !== "string" || id.trim() === "") {
    throw new ToolError(
      "invalid_input",
      "id must be a non‑empty string",
      false
    );
  }

  try {
    const customer = await db.customers.findById(id);
    if (!customer) {
      throw new ToolError("not_found", `customer ${id} not found`, false);
    }
    return customer;
  } catch (error) {
    console.error("get_customer failed", { id, error });
    if (error instanceof ToolError) {
      throw error;
    }

    throw new ToolError(
      "dependency_unavailable",
      "customer lookup is temporarily unavailable",
      true
    );
  }
});

Fix your transport hygiene

// Wrong for stdio servers
console.log("server started");

// Correct for stdio servers
console.error("server started");

Surface the failure honestly at the protocol edge

// `code` and `retryable` are part of this server's error contract,
// not fields that MCP invents automatically for you.
function toMcpErrorResult(error: ToolError) {
  return {
    isError: true,
    code: error.code,
    retryable: error.retryable,
    content: [{ type: "text", text: error.message }],
  };
}

How to fix it

The fix starts at the boundary. Validate inputs where the tool begins, not deeper in the call stack after the request has already become harder to reason about. When something does fail, preserve the real failure mode whenever you can. Operators need useful errors, not vague theatrical ones, and callers need to know the difference between a not‑found result and a broken dependency.

In practice, this usually means:

Stable error codes.
Clear human‑readable messages.
A separate place for internal diagnostic detail.

Treat operational hygiene as part of the contract. Keep protocol traffic separate from diagnostics, especially on stdio where stdout is data and stderr is logs. On remote HTTP transports, the equivalent discipline is session lifecycle, reconnect behavior, and resumability: if those are inconsistent, the system becomes hard to reason about even when the handlers themselves are correct.

Add negative tests for malformed inputs, missing fields, downstream timeouts, and not‑found cases, then standardize error shape across the server so every tool does not invent its own private version of confusion. The important part is that typed failures survive translation through the MCP boundary instead of being collapsed into one generic error on the way out. MCP gives you the transport and result channel; the stable fields that make failures actionable still need to be part of your own server contract.

Lessons from the Trenches

This pattern shows up in modelcontextprotocol/typescript-sdk #699, where a real tool exception was replaced by a misleading -32602 structured‑content error. Once a system starts lying about why it failed, every downstream debugging step becomes more expensive. A vague error is no longer a helpful signal—it becomes a liability.

Wrath

Wrath is a reaction to uncertainty or failure, expressed with force rather than control. You can usually hear wrath in design conversations before you see it in code:

“If it fails, retry.”
“If it is slow, poll faster.”
“If the stream disconnects, reconnect immediately.”

This is the operational version of losing your temper.

How to spot it

Retry loops or reconnect storms show up in logs during an outage.
A single failing dependency can suddenly cause duplicate requests, duplicate jobs, or repeated server starts.
Clients keep hammering an endpoint that is already degraded.
Timeout graphs and request‑volume graphs rise together.

Example

During a real outage an internal app or assistant loses its MCP connection while someone is already in the middle of an incident response. The user experiences a single dropped connection or a spinner that hangs too long. Under the hood, an impatient client can turn that single interruption into repeated process starts, duplicate requests, and more load on a system that is already failing.

Before

async function ensureConnection(client: McpClient, serverCommand: string) {
  while (true) {
    try {
      await client.connect(new StdioTransport(serverCommand));
      return;
    } catch {
      await sleep(100);
    }
  }
}

After

function sleepWithAbort(ms: number, signal: AbortSignal) {
  return new Promise((resolve, reject) => {
    const timeout = setTimeout(() => {
      signal.removeEventListener("abort", onAbort);
      resolve();
    }, ms);

    const onAbort = () => {
      clearTimeout(timeout);
      signal.removeEventListener("abort", onAbort);
      reject(new Error("connection cancelled"));
    };

    if (signal.aborted) {
      onAbort();
      return;
    }

    signal.addEventListener("abort", onAbort, { once: true });
  });
}

async function ensureConnection(
  client: McpClient,
  serverCommand: string,
  abortSignal: AbortSignal
) {
  for (let attempt = 1; attempt <= 5; attempt += 1) {
    try {
      if (abortSignal.aborted) throw new Error("connection cancelled");
      await client.connect(new StdioTransport(serverCommand));
      return;
    } catch (error) {
      if (abortSignal.aborted) throw error;
      if (attempt === 5) throw error;

      const backoffMs = attempt * 1000 + Math.floor(Math.random() * 250);
      await sleepWithAbort(backoffMs, abortSignal);
    }
  }
}

How to fix it

Learn to stop. Put a hard upper bound on retries and add progressive back‑off with jitter so clients don’t all reconnect in lockstep.
Thread cancellation. Propagate an abort signal through every outbound request and long‑running operation so the system can stand down instead of escalating mindlessly.
Decide what is safe to retry.
- Idempotent reads and reconnect attempts are generally safe.
- Operations with side effects should avoid automatic retries unless you have an explicit idempotency key or another deduplication guard.
Make retries visible. Instrument and monitor retry counts, back‑off delays, and reconnect storms. If you’re not measuring them, you won’t spot a problem until production tells you the hard way.

The same warning applies to managed edges and gateways. Throttling, proxy retries, and policy enforcement can mitigate damage, but they do not fix a backend operation that is non‑idempotent, vague about failure, or unsafe to repeat.

Lessons from the Trenches

modelcontextprotocol/inspector #293 – connecting caused repeated server starts.
modelcontextprotocol/inspector #723 – reconnect logic did not preserve enough state to resume safely.

The lesson is simple: retries are part of system design, not a band‑aid you slap on at the edge. If your retry policy is not explicit, you don’t really have one.

Why Operational Sins Are Hard to Fix

Operational sins usually demand shared infrastructure rather than isolated patches.

Sloth fixes often mean building a validation layer, an error‑policy, structured logging, and a test harness for unhappy paths.
Wrath fixes tend to reach across transport clients, job runners, background workers, and UI status handling. You may need:
- a retry helper,
- a back‑off policy,
- a cancellation model,
- idempotency protection, and
- dashboards that show retries and reconnects.

That work is easy to postpone because it doesn’t demo well. But once it exists, every future tool becomes cheaper to run, debug, and trust.

The Seven Deadly Sins of MCP: Operational Sins

Operational Sins: Sloth and Wrath

Sloth

How to spot it

Example

Fix your transport hygiene

Surface the failure honestly at the protocol edge

How to fix it

Lessons from the Trenches

Wrath

How to spot it

Example

Before

After

How to fix it

Lessons from the Trenches

Why Operational Sins Are Hard to Fix

Related posts

NexusTriage: Turning Notion into an Autonomous, Self-Healing System 🧠

Effect-TS Has a Free API: TypeScript's Missing Standard Library for Production Apps

Effect Has a Free TypeScript Library — The Missing Standard Library for TS

The Webhook Failure Modes Nobody Warns You About