Our Videos Silently Failed for a Week — How a Stale Env Var Cost Us $60 and 12 Unhappy Users

Published: (March 9, 2026 at 06:01 AM EDT)
7 min read
Source: Dev.to

Source: Dev.to

TL;DR

  • What happened? Fal.ai credits were being spent, but no videos were delivered because the Remotion Lambda function name was stale.
  • Why? A manual step in the deploy script failed to update the REMOTION_LAMBDA_FUNCTION_NAME environment variable after upgrading Remotion.
  • How was it fixed? A targeted retry script re‑rendered the already‑saved assets, and the deploy script was automated to keep the env var in sync.
  • What now? An hourly Inngest cron monitors for projects stuck in rendering and alerts us before billing spikes again.

1. The symptom

“You know that sinking feeling when you check your billing dashboard and something doesn’t add up?”

RepoClip – my AI video‑generation SaaS – was burning ≈ $20 / day on Fal.ai credits for three consecutive days (Mar 7‑9).
The first guess was “more users = more videos”, but the videos never reached any user.

2. RepoClip video pipeline

GitHub URL → Gemini Analysis → Kling Video Clips (fal.ai) → Remotion Lambda Render → Done
  • Each “Video Short” → 5 AI clips (Kling 3.0 Pro) → stitched with narration (Remotion on AWS Lambda).

The pipeline is orchestrated by Inngest, which memoizes each step so retries don’t redo completed work.

Relevant steps (simplified)

StepDescription
1Fetch GitHub code
2Analyze with Gemini
3Generate video clips (fal.ai) ← money spent here
4Trigger Remotion Lambda render ← failing
5Poll for render completion
6Update project status to completed

3. Billing clue

Fal.ai dashboard showed $20 / day on Mar 7, 8, 9 → roughly 2–4 video generations per day, which seemed plausible for organic traffic.

4. Database query

SELECT
  created_at::date AS date,
  status,
  COUNT(*) AS count
FROM projects
WHERE created_at >= '2026-03-01'
GROUP BY date, status
ORDER BY date;

Result

DateStatusCount
Mar 1completed3
Mar 1failed5
Mar 1rendering1
Mar 2rendering2
Mar 5rendering1
Mar 6rendering2
Mar 7rendering2
Mar 8rendering4

After March 1, no project ever reached completed. Every video was stuck in rendering – the step right after Fal.ai finished generating clips.

5. The root cause: stale Lambda function name

Environment variable (what we thought)

REMOTION_LAMBDA_FUNCTION_NAME=remotion-render-4-0-414-mem2048mb-disk2048mb-600sec

Actual Lambda functions in AWS

aws lambda list-functions --query 'Functions[?starts_with(FunctionName, `remotion`)]'

Output:

remotion-render-4-0-429-mem2048mb-disk2048mb-600sec
remotion-render-4-0-429-mem3008mb-disk4096mb-900sec

The function remotion-render-4-0-414… no longer existed.
We had upgraded Remotion from v4.0.414 → v4.0.429, deployed new Lambdas, deleted the old ones, but forgot to update the env var on Vercel.

Consequences

  • renderMediaOnLambda() threw ResourceNotFoundException.
  • Inngest retried silently; no client‑side error, no GA4 “video_generate_complete” event, no Sentry alert.
  • Billing continued because Fal.ai still generated clips.

6. Recovery – targeted retry script

All 12 stuck projects already had their assets persisted (assets JSONB column).
We avoided re‑generating the expensive Fal.ai clips and simply re‑rendered the saved assets.

// The key insight: assets are already saved, just re‑render
const { renderId, bucketName } = await renderMediaOnLambda({
  region: REGION,
  functionName: FUNCTION_NAME, // now pointing to the correct function
  serveUrl: SERVE_URL,
  composition: "ProductVideo",
  inputProps, // built from saved assets
  codec: "h264",
  // …
});

Result: 12/12 videos recovered (9 on first attempt, 3 after transient network timeouts).
Zero additional Fal.ai charges; users received completion emails.

7. Automation – never forget to update the env var again

Old manual step

echo "Set the following environment variables:"
echo "  REMOTION_LAMBDA_FUNCTION_NAME="

New automated step

# Extract function name from deploy output
FUNC_NAME=$(echo "$FUNC_OUTPUT" | grep -oE 'remotion-render-[a-zA-Z0-9-]+' | head -1)

# Verify function exists
aws lambda get-function --function-name "$FUNC_NAME" --region "$REGION"

# Auto‑update Vercel + local env
echo -n "$FUNC_NAME" | npx vercel env rm REMOTION_LAMBDA_FUNCTION_NAME production -y
echo -n "$FUNC_NAME" | npx vercel env add REMOTION_LAMBDA_FUNCTION_NAME production
sed -i '' "s|^REMOTION_LAMBDA_FUNCTION_NAME=.*|REMOTION_LAMBDA_FUNCTION_NAME=$FUNC_NAME|" .env.local

Now the deploy script extracts the new Lambda name, verifies it, and updates Vercel and the local .env automatically.

8. Ongoing monitoring – cron job for stuck renders

export const monitorStuckRendersFunction = inngest.createFunction(
  { id: "monitor-stuck-renders" },
  { cron: "0 * * * *" }, // every hour
  async ({ step }) => {
    const stuckProjects = await step.run("check-stuck-projects", async () => {
      const threshold = new Date(Date.now() - 30 * 60 * 1000).toISOString();
      const { data } = await supabase
        .from("projects")
        .select("*")
        .eq("status", "rendering")
        .lt("updated_at", threshold);
      return data;
    });

    if (stuckProjects?.length) {
      // Notify Slack / email / create issue
      await step.run("alert", async () => {
        // …implementation…
      });
    }
  }
);
  • What it does: Every hour it fetches projects in rendering for > 30 min and alerts the team.
  • Why: Prevents silent billing spikes and gives us a safety net for future regressions.

9. Takeaways

✅ What worked❌ What failed
Assets persisted before render → cheap recoveryManual env‑var update step was missed
Inngest memoization prevented duplicate Fal.ai chargesSilent retries hid the ResourceNotFoundException
Billing anomaly triggered investigationNo GA4 “complete” event → missing visibility
Automated deploy script now guarantees env‑var syncNo prior monitoring for stuck renders

Bottom line: A tiny manual step caused a $60‑plus waste and a poor user experience. By persisting intermediate assets, automating environment updates, and adding proactive monitoring, we turned a costly outage into a learning opportunity. 🚀

Monitoring Stuck Projects

// Example query to find projects stuck in the “rendering” state
const { data: stuckProjects } = await supabase
  .from("projects")
  .select("id, repo_name, content_mode, updated_at")
  .eq("status", "rendering")
  .lt("updated_at", threshold);

return data ?? [];

// If any projects are stuck, send an alert email with their details
if (stuckProjects.length > 0) {
  // Send alert email with project details
}

If this had existed a week ago, we’d have known within an hour instead of seven days.

Real‑time Status Events

We added two events that fire when the user’s browser receives a status change via Supabase Realtime:

// ProjectStatusListener.tsx
const channel = supabase
  .channel(`project-${projectId}`)
  .on(
    "postgres_changes",
    { /* … */ },
    (payload) => {
      if (payload.new?.status === "completed") {
        gaEvent("video_generate_complete", { project_id: projectId });
      } else if (payload.new?.status === "failed") {
        gaEvent("video_generate_fail", { project_id: projectId });
      }
    }
  )
  .subscribe();

Funnel Query in BigQuery

-- Start‑to‑complete ratio per day
SELECT
  event_date,
  COUNT(DISTINCT CASE WHEN event_name = 'video_generate_start'
    THEN user_pseudo_id END) AS start_users,
  COUNT(DISTINCT CASE WHEN event_name = 'video_generate_complete'
    THEN user_pseudo_id END) AS complete_users
FROM events_*
GROUP BY event_date;

A sudden drop in the complete/start ratio now appears as a clear signal.

Cost Discovery

While investigating, we realized every free‑tier user was receiving the same “Kling 3.0 Pro” clips as paying customers.

  • Cost per video (Pro): ≈ $5.60
  • Conversion rate: ≈ 3 %
  • Result: Unsustainable customer‑acquisition cost

The Fix

PlanClip TypeClipsApprox. LengthCost per Video
FreeKling 3.0 Standard3~15 s$2.52
PaidKling 3.0 Pro5~25 s$5.60
  • Turning “Kling 3.0 Pro quality” into a tangible upgrade incentive.
  • Cutting free‑tier costs by 55 %.

Lessons Learned

  1. Env vars are a silent single point of failure – automate their lifecycle.
  2. Background‑job failures are invisible by default – add explicit monitoring for “things that should have finished but didn’t.”
  3. Track completion, not just initiation – the absence of video_generate_complete data is a critical signal.
  4. Persist intermediate results – allowed recovery without extra Fal.ai charges.
  5. Billing anomalies are monitoring signals – set up alerts on unexpected spend patterns.

Try RepoClip

RepoClip generates AI‑powered promotional videos from GitHub repositories.
Paste any public repo URL and get a video in minutes — free, no credit card required.

0 views
Back to Blog

Related posts

Read more »