Our Videos Silently Failed for a Week — How a Stale Env Var Cost Us $60 and 12 Unhappy Users

Published: 15 hours ago (March 9, 2026 at 06:01 AM EDT)

7 min read

Source: Dev.to

TL;DR

What happened? Fal.ai credits were being spent, but no videos were delivered because the Remotion Lambda function name was stale.
Why? A manual step in the deploy script failed to update the REMOTION_LAMBDA_FUNCTION_NAME environment variable after upgrading Remotion.
How was it fixed? A targeted retry script re‑rendered the already‑saved assets, and the deploy script was automated to keep the env var in sync.
What now? An hourly Inngest cron monitors for projects stuck in rendering and alerts us before billing spikes again.

1. The symptom

“You know that sinking feeling when you check your billing dashboard and something doesn’t add up?”

RepoClip – my AI video‑generation SaaS – was burning ≈ $20 / day on Fal.ai credits for three consecutive days (Mar 7‑9).
The first guess was “more users = more videos”, but the videos never reached any user.

2. RepoClip video pipeline

GitHub URL → Gemini Analysis → Kling Video Clips (fal.ai) → Remotion Lambda Render → Done

Each “Video Short” → 5 AI clips (Kling 3.0 Pro) → stitched with narration (Remotion on AWS Lambda).

The pipeline is orchestrated by Inngest, which memoizes each step so retries don’t redo completed work.

Relevant steps (simplified)

Step	Description
1	Fetch GitHub code
2	Analyze with Gemini
3	Generate video clips (fal.ai) ← money spent here
4	Trigger Remotion Lambda render ← failing
5	Poll for render completion
6	Update project status to `completed`

3. Billing clue

Fal.ai dashboard showed $20 / day on Mar 7, 8, 9 → roughly 2–4 video generations per day, which seemed plausible for organic traffic.

4. Database query

SELECT
  created_at::date AS date,
  status,
  COUNT(*) AS count
FROM projects
WHERE created_at >= '2026-03-01'
GROUP BY date, status
ORDER BY date;

Result

Date	Status	Count
Mar 1	completed	3
Mar 1	failed	5
Mar 1	rendering	1
Mar 2	rendering	2
Mar 5	rendering	1
Mar 6	rendering	2
Mar 7	rendering	2
Mar 8	rendering	4

After March 1, no project ever reached completed. Every video was stuck in rendering – the step right after Fal.ai finished generating clips.

5. The root cause: stale Lambda function name

Environment variable (what we thought)

REMOTION_LAMBDA_FUNCTION_NAME=remotion-render-4-0-414-mem2048mb-disk2048mb-600sec

Actual Lambda functions in AWS

aws lambda list-functions --query 'Functions[?starts_with(FunctionName, `remotion`)]'

Output:

remotion-render-4-0-429-mem2048mb-disk2048mb-600sec
remotion-render-4-0-429-mem3008mb-disk4096mb-900sec

The function remotion-render-4-0-414… no longer existed.
We had upgraded Remotion from v4.0.414 → v4.0.429, deployed new Lambdas, deleted the old ones, but forgot to update the env var on Vercel.

Consequences

renderMediaOnLambda() threw ResourceNotFoundException.
Inngest retried silently; no client‑side error, no GA4 “video_generate_complete” event, no Sentry alert.
Billing continued because Fal.ai still generated clips.

6. Recovery – targeted retry script

All 12 stuck projects already had their assets persisted (assets JSONB column).
We avoided re‑generating the expensive Fal.ai clips and simply re‑rendered the saved assets.

// The key insight: assets are already saved, just re‑render
const { renderId, bucketName } = await renderMediaOnLambda({
  region: REGION,
  functionName: FUNCTION_NAME, // now pointing to the correct function
  serveUrl: SERVE_URL,
  composition: "ProductVideo",
  inputProps, // built from saved assets
  codec: "h264",
  // …
});

Result: 12/12 videos recovered (9 on first attempt, 3 after transient network timeouts).
Zero additional Fal.ai charges; users received completion emails.

7. Automation – never forget to update the env var again

Old manual step

echo "Set the following environment variables:"
echo "  REMOTION_LAMBDA_FUNCTION_NAME="

New automated step

# Extract function name from deploy output
FUNC_NAME=$(echo "$FUNC_OUTPUT" | grep -oE 'remotion-render-[a-zA-Z0-9-]+' | head -1)

# Verify function exists
aws lambda get-function --function-name "$FUNC_NAME" --region "$REGION"

# Auto‑update Vercel + local env
echo -n "$FUNC_NAME" | npx vercel env rm REMOTION_LAMBDA_FUNCTION_NAME production -y
echo -n "$FUNC_NAME" | npx vercel env add REMOTION_LAMBDA_FUNCTION_NAME production
sed -i '' "s|^REMOTION_LAMBDA_FUNCTION_NAME=.*|REMOTION_LAMBDA_FUNCTION_NAME=$FUNC_NAME|" .env.local

Now the deploy script extracts the new Lambda name, verifies it, and updates Vercel and the local .env automatically.

8. Ongoing monitoring – cron job for stuck renders

export const monitorStuckRendersFunction = inngest.createFunction(
  { id: "monitor-stuck-renders" },
  { cron: "0 * * * *" }, // every hour
  async ({ step }) => {
    const stuckProjects = await step.run("check-stuck-projects", async () => {
      const threshold = new Date(Date.now() - 30 * 60 * 1000).toISOString();
      const { data } = await supabase
        .from("projects")
        .select("*")
        .eq("status", "rendering")
        .lt("updated_at", threshold);
      return data;
    });

    if (stuckProjects?.length) {
      // Notify Slack / email / create issue
      await step.run("alert", async () => {
        // …implementation…
      });
    }
  }
);

What it does: Every hour it fetches projects in rendering for > 30 min and alerts the team.
Why: Prevents silent billing spikes and gives us a safety net for future regressions.

9. Takeaways

✅ What worked	❌ What failed
Assets persisted before render → cheap recovery	Manual env‑var update step was missed
Inngest memoization prevented duplicate Fal.ai charges	Silent retries hid the `ResourceNotFoundException`
Billing anomaly triggered investigation	No GA4 “complete” event → missing visibility
Automated deploy script now guarantees env‑var sync	No prior monitoring for stuck renders

Bottom line: A tiny manual step caused a $60‑plus waste and a poor user experience. By persisting intermediate assets, automating environment updates, and adding proactive monitoring, we turned a costly outage into a learning opportunity. 🚀

Monitoring Stuck Projects

// Example query to find projects stuck in the “rendering” state
const { data: stuckProjects } = await supabase
  .from("projects")
  .select("id, repo_name, content_mode, updated_at")
  .eq("status", "rendering")
  .lt("updated_at", threshold);

return data ?? [];

// If any projects are stuck, send an alert email with their details
if (stuckProjects.length > 0) {
  // Send alert email with project details
}

If this had existed a week ago, we’d have known within an hour instead of seven days.

Real‑time Status Events

We added two events that fire when the user’s browser receives a status change via Supabase Realtime:

// ProjectStatusListener.tsx
const channel = supabase
  .channel(`project-${projectId}`)
  .on(
    "postgres_changes",
    { /* … */ },
    (payload) => {
      if (payload.new?.status === "completed") {
        gaEvent("video_generate_complete", { project_id: projectId });
      } else if (payload.new?.status === "failed") {
        gaEvent("video_generate_fail", { project_id: projectId });
      }
    }
  )
  .subscribe();

Funnel Query in BigQuery

-- Start‑to‑complete ratio per day
SELECT
  event_date,
  COUNT(DISTINCT CASE WHEN event_name = 'video_generate_start'
    THEN user_pseudo_id END) AS start_users,
  COUNT(DISTINCT CASE WHEN event_name = 'video_generate_complete'
    THEN user_pseudo_id END) AS complete_users
FROM events_*
GROUP BY event_date;

A sudden drop in the complete/start ratio now appears as a clear signal.

Cost Discovery

While investigating, we realized every free‑tier user was receiving the same “Kling 3.0 Pro” clips as paying customers.

Cost per video (Pro): ≈ $5.60
Conversion rate: ≈ 3 %
Result: Unsustainable customer‑acquisition cost

The Fix

Plan	Clip Type	Clips	Approx. Length	Cost per Video
Free	Kling 3.0 Standard	3	~15 s	$2.52
Paid	Kling 3.0 Pro	5	~25 s	$5.60

Turning “Kling 3.0 Pro quality” into a tangible upgrade incentive.
Cutting free‑tier costs by 55 %.

Lessons Learned

Env vars are a silent single point of failure – automate their lifecycle.
Background‑job failures are invisible by default – add explicit monitoring for “things that should have finished but didn’t.”
Track completion, not just initiation – the absence of video_generate_complete data is a critical signal.
Persist intermediate results – allowed recovery without extra Fal.ai charges.
Billing anomalies are monitoring signals – set up alerts on unexpected spend patterns.

Try RepoClip

RepoClip generates AI‑powered promotional videos from GitHub repositories.
Paste any public repo URL and get a video in minutes — free, no credit card required.