The 4-test protocol that isolated a 9 ms Stripe SDK crash on Next 16

Published: (June 14, 2026 at 05:56 AM EDT)
7 min read
Source: Dev.to

Source: Dev.to

The number that lied

Friday May 15, 4:13 PM. The Sentry alert pings on my phone. The first Phase 1 re-enrolling student waits in front of the payment screen, her name at the top of my tab. I put down the can, I reopen the screen. The mug with Françoise’s face on it, on the desk next door, catches a yellow reflection I notice without looking at. The stack trace fills the screen.

The stack trace opens, nine fields out of ten at null, and a number I didn’t see coming.

type       = "StripeConnectionError"
message    = "An error occurred with our connection to Stripe."
code       = null
statusCode = null
requestId  = null
duration   = 9 ms
Enter fullscreen mode


Exit fullscreen mode

Nine milliseconds. On a Vercel route in Paris region, DNS resolves in forty ms, a TLS handshake costs one to two hundred. Nine milliseconds isn’t a network call that failed. It’s a network call that never happened. The SDK didn’t reach the wire.

Instinct immediately offers three patches. Vercel serverless timeout — I add maxDuration, redeploy. Revoked key — I’ll rotate it. Stripe account restricted after the live switch — I open a support ticket. These three hypotheses are plausible. None of the three is falsifiable from the symptom alone, and that’s precisely what makes them dangerous: each opens a fifteen-to-thirty-minute cycle with rollback at the end if it’s wrong. Multiplied by three, half a day lost with the customer still clicking.

I don’t have time. A student is waiting.

Four tests, in order

I know the incident class — “preview works, prod breaks”, or its mirror. The rule for this class is that you fix nothing until you’ve discriminated the layers. Four tests, executed in order. Each eliminates a family of hypotheses, not an isolated hypothesis. And each is designed to refute what it interrogates — because a test that seeks to confirm always finds, by selection, what it’s looking for.

Test 1 — reproduce in the witness environment. I rerun the same funnel in preview, with the sk_test_ key. Checkout opens in three hundred fourteen milliseconds, clean. Immediate consequence: it’s not the application code. The code is strictly identical between preview and prod; only environment variables, the Vercel plan on that region, and the Stripe key vary. Three variables only, and the fog already thickens on the right side.

Test 2 — minimal endpoint. I deploy a Vercel route with one useful line, nodejs runtime explicitly forced, which calls stripe.balance.retrieve() — the most stripped-down SDK call possible, no line_items, no metadata, no idempotencyKey, none of the Checkout’s business complexity. In preview: two hundred milliseconds, success. In prod: nine milliseconds, the same StripeConnectionError. Consequence: the problem isn’t in the Checkout parameters. It isn’t in business logic gone sideways either. The SDK itself crashes on the simplest possible call.

Test 3 — bypass the suspect dependency. Instead of calling the SDK, I fetch directly to https://api.stripe.com/v1/balance with the header Authorization: Bearer sk_live_…. In prod, on the same Vercel route: 200 OK, three hundred fourteen milliseconds, payload confirming livemode: true. Consequence — and it’s the most precious one — the Vercel→Stripe network infrastructure works. It’s strictly the SDK that doesn’t cross the network layer. Neither Vercel, nor Cloudflare upstream, nor Stripe downstream are at fault.

Niran walks behind my shoulder at that moment, reads the curl output on the terminal. He says three words, “it’s not the network”, and walks back to his desk without elaborating. Economy of gesture.

Test 4 — read the source at the exact error point. The stack trace points to node_modules/stripe/esm/RequestSender.js:400:41. I open the file in the deployed Vercel repo. Line four hundred is the .catch(error) of the internal HTTP client’s promise. The SDK was waiting for a response from its own internal client, and its own internal client rejected immediately, before even issuing a request. I climb back into the lib’s package.json:

"exports": {
  "worker": {
    "import": "./esm/stripe.esm.worker.js",
    "require": "./cjs/stripe.cjs.worker.js"
  },
  "default": {
    "import": {
      "default": "./esm/stripe.esm.node.js"
    }
  }
}
Enter fullscreen mode


Exit fullscreen mode

Here’s what was happening. The stripe^22 package.json declares a conditional "worker" export aimed at Cloudflare Workers environments. The Next 16 bundler, despite export const runtime = 'nodejs' explicitly declared at the top of the route, resolves this "worker" condition when bundling Server Actions in production. The bundle then loads stripe.esm.worker.js, an SDK variant that rests on the Worker runtime’s standard fetch and doesn’t have the native Node HTTP client. This variant, executed on Vercel’s Node runtime, fails silently at the initialisation of its HTTP client — for a reason probably tied to a Cloudflare-specific feature absent from Vercel’s runtime — and the promise of the very first request rejects within the next millisecond.

The hypothesis isn’t a hundred percent confirmed. But it’s coherent with the three material facts accumulated: the prod/preview gap that depends on bundle context, the synchronous nine-millisecond failure without network, and the total absence of requestId because no request was ever issued.

The workaround written, then the ROI counted

In twenty minutes, the diagnostic holds. Forty more minutes, and the helper lib/stripe-fetch.ts is in production on six surfaces — Checkout Sessions, retrieve PaymentIntent, retrieve BalanceTransaction, create off_session PaymentIntent, retrieve Checkout Session, and billing Payment Links.

// lib/stripe-fetch.ts
export async function stripePost(
  path: string,
  params: Record,
  options?: { idempotencyKey?: string },
): Promise {
  const headers: Record = {
    Authorization: `Bearer ${getKey()}`,
    'Content-Type': 'application/x-www-form-urlencoded',
  }
  if (options?.idempotencyKey) headers['Idempotency-Key'] = options.idempotencyKey
  const res = await fetch(`https://api.stripe.com/v1/${path}`, {
    method: 'POST',
    headers,
    body: encodeParams(params),
  })
  return parseStripeResponse(res)
}

// app/inscription/actions.ts::finaliserReinscription (excerpt)
const stripeRes = await fetch('https://api.stripe.com/v1/checkout/sessions', {
  method: 'POST',
  headers: { Authorization: `Bearer ${stripeKey}`, 'Content-Type': 'application/x-www-form-urlencoded' },
  body: encodeParams(checkoutParams),
})
Enter fullscreen mode


Exit fullscreen mode

At 5:35 PM, I rerun the funnel in prod with a fake card: the Checkout session opens, livemode confirmed, card plus Link plus Google Pay methods. The 4:13 PM customer receives the apology email and the new link in the next minute. Phase 2 on Monday May 19, sixty-five returning students to chase, unblocked materially.

Had I started by patching the timeout, I would have redeployed, waited five minutes, retested, observed the failure, removed the patch, waited five more minutes: a twenty-minute cycle. Add the key rotation — fifteen minutes to generate, propagate, wait for Vercel cache invalidation. And the Stripe support ticket: between two and forty-eight opaque hours, while production bleeds. Compared to these three patches, the protocol holds in under thirty minutes and lands on the true cause — not on a neighbour of the true cause.

Generalisation, soberly

The protocol holds for any class “same code behaves differently across environments”. Trigger symptoms I now keep on top: StripeConnectionError, ECONNREFUSED or ETIMEDOUT at runtime but not at build, Module not found that only appears in prod, or worse — a silent try / catch that returns a misleading fallback and makes you think the main branch succeeded. Four tests, in the same order. Witness, minimal, bypass, source.

The protocol does not hold for business bugs — a wrong SQL query, a miscalibrated if, an application logic that returns the wrong result. There the cause is in the code you wrote, and a targeted grep finds it, not a layer discrimination.

Coda

You don’t fix a firing defect by looking at the piece. You look at the kiln’s curve, the gas station, the chimney draught. The application code is the piece — it comes out as you shaped it. The four tests interrogate the kiln. Each shuts down a possible lamp until only one remains, which is the right one. Thirty minutes instead of half a day, and above all: the certainty of having patched where it had to be patched, not in a flattering neighbourhood that lets the real bug sleep until the next incident.

The 4-test protocol is the applicative instance of the Counterpart Toolkit’s R4 Falsify before fix, on the incident class “environment bug”. The general rule asks for three probes designed to refute; this class deserves four, in a fixed order. That’s all. But that all, the day production bleeds, is worth the half day it saves you.

Counterpart Toolkit v0.7, R4 Falsify before fix. Canonical reference: github.com/michelfaure/doctrine-counterpart. Scenes recomposed, names calibrated on the recurring cast cards of the series.

0 views
Back to Blog

Related posts

Read more »