Our Videos Silently Failed for a Week — How a Stale Env Var Cost Us $60 and 12 Unhappy Users
Source: Dev.to
TL;DR
- What happened? Fal.ai credits were being spent, but no videos were delivered because the Remotion Lambda function name was stale.
- Why? A manual step in the deploy script failed to update the
REMOTION_LAMBDA_FUNCTION_NAMEenvironment variable after upgrading Remotion. - How was it fixed? A targeted retry script re‑rendered the already‑saved assets, and the deploy script was automated to keep the env var in sync.
- What now? An hourly Inngest cron monitors for projects stuck in
renderingand alerts us before billing spikes again.
1. The symptom
“You know that sinking feeling when you check your billing dashboard and something doesn’t add up?”
RepoClip – my AI video‑generation SaaS – was burning ≈ $20 / day on Fal.ai credits for three consecutive days (Mar 7‑9).
The first guess was “more users = more videos”, but the videos never reached any user.
2. RepoClip video pipeline
GitHub URL → Gemini Analysis → Kling Video Clips (fal.ai) → Remotion Lambda Render → Done
- Each “Video Short” → 5 AI clips (Kling 3.0 Pro) → stitched with narration (Remotion on AWS Lambda).
The pipeline is orchestrated by Inngest, which memoizes each step so retries don’t redo completed work.
Relevant steps (simplified)
| Step | Description |
|---|---|
| 1 | Fetch GitHub code |
| 2 | Analyze with Gemini |
| 3 | Generate video clips (fal.ai) ← money spent here |
| 4 | Trigger Remotion Lambda render ← failing |
| 5 | Poll for render completion |
| 6 | Update project status to completed |
3. Billing clue
Fal.ai dashboard showed $20 / day on Mar 7, 8, 9 → roughly 2–4 video generations per day, which seemed plausible for organic traffic.
4. Database query
SELECT
created_at::date AS date,
status,
COUNT(*) AS count
FROM projects
WHERE created_at >= '2026-03-01'
GROUP BY date, status
ORDER BY date;
Result
| Date | Status | Count |
|---|---|---|
| Mar 1 | completed | 3 |
| Mar 1 | failed | 5 |
| Mar 1 | rendering | 1 |
| Mar 2 | rendering | 2 |
| Mar 5 | rendering | 1 |
| Mar 6 | rendering | 2 |
| Mar 7 | rendering | 2 |
| Mar 8 | rendering | 4 |
After March 1, no project ever reached
completed. Every video was stuck inrendering– the step right after Fal.ai finished generating clips.
5. The root cause: stale Lambda function name
Environment variable (what we thought)
REMOTION_LAMBDA_FUNCTION_NAME=remotion-render-4-0-414-mem2048mb-disk2048mb-600sec
Actual Lambda functions in AWS
aws lambda list-functions --query 'Functions[?starts_with(FunctionName, `remotion`)]'
Output:
remotion-render-4-0-429-mem2048mb-disk2048mb-600sec
remotion-render-4-0-429-mem3008mb-disk4096mb-900sec
The function remotion-render-4-0-414… no longer existed.
We had upgraded Remotion from v4.0.414 → v4.0.429, deployed new Lambdas, deleted the old ones, but forgot to update the env var on Vercel.
Consequences
renderMediaOnLambda()threwResourceNotFoundException.- Inngest retried silently; no client‑side error, no GA4 “video_generate_complete” event, no Sentry alert.
- Billing continued because Fal.ai still generated clips.
6. Recovery – targeted retry script
All 12 stuck projects already had their assets persisted (assets JSONB column).
We avoided re‑generating the expensive Fal.ai clips and simply re‑rendered the saved assets.
// The key insight: assets are already saved, just re‑render
const { renderId, bucketName } = await renderMediaOnLambda({
region: REGION,
functionName: FUNCTION_NAME, // now pointing to the correct function
serveUrl: SERVE_URL,
composition: "ProductVideo",
inputProps, // built from saved assets
codec: "h264",
// …
});
Result: 12/12 videos recovered (9 on first attempt, 3 after transient network timeouts).
Zero additional Fal.ai charges; users received completion emails.
7. Automation – never forget to update the env var again
Old manual step
echo "Set the following environment variables:"
echo " REMOTION_LAMBDA_FUNCTION_NAME="
New automated step
# Extract function name from deploy output
FUNC_NAME=$(echo "$FUNC_OUTPUT" | grep -oE 'remotion-render-[a-zA-Z0-9-]+' | head -1)
# Verify function exists
aws lambda get-function --function-name "$FUNC_NAME" --region "$REGION"
# Auto‑update Vercel + local env
echo -n "$FUNC_NAME" | npx vercel env rm REMOTION_LAMBDA_FUNCTION_NAME production -y
echo -n "$FUNC_NAME" | npx vercel env add REMOTION_LAMBDA_FUNCTION_NAME production
sed -i '' "s|^REMOTION_LAMBDA_FUNCTION_NAME=.*|REMOTION_LAMBDA_FUNCTION_NAME=$FUNC_NAME|" .env.local
Now the deploy script extracts the new Lambda name, verifies it, and updates Vercel and the local .env automatically.
8. Ongoing monitoring – cron job for stuck renders
export const monitorStuckRendersFunction = inngest.createFunction(
{ id: "monitor-stuck-renders" },
{ cron: "0 * * * *" }, // every hour
async ({ step }) => {
const stuckProjects = await step.run("check-stuck-projects", async () => {
const threshold = new Date(Date.now() - 30 * 60 * 1000).toISOString();
const { data } = await supabase
.from("projects")
.select("*")
.eq("status", "rendering")
.lt("updated_at", threshold);
return data;
});
if (stuckProjects?.length) {
// Notify Slack / email / create issue
await step.run("alert", async () => {
// …implementation…
});
}
}
);
- What it does: Every hour it fetches projects in
renderingfor > 30 min and alerts the team. - Why: Prevents silent billing spikes and gives us a safety net for future regressions.
9. Takeaways
| ✅ What worked | ❌ What failed |
|---|---|
| Assets persisted before render → cheap recovery | Manual env‑var update step was missed |
| Inngest memoization prevented duplicate Fal.ai charges | Silent retries hid the ResourceNotFoundException |
| Billing anomaly triggered investigation | No GA4 “complete” event → missing visibility |
| Automated deploy script now guarantees env‑var sync | No prior monitoring for stuck renders |
Bottom line: A tiny manual step caused a $60‑plus waste and a poor user experience. By persisting intermediate assets, automating environment updates, and adding proactive monitoring, we turned a costly outage into a learning opportunity. 🚀
Monitoring Stuck Projects
// Example query to find projects stuck in the “rendering” state
const { data: stuckProjects } = await supabase
.from("projects")
.select("id, repo_name, content_mode, updated_at")
.eq("status", "rendering")
.lt("updated_at", threshold);
return data ?? [];
// If any projects are stuck, send an alert email with their details
if (stuckProjects.length > 0) {
// Send alert email with project details
}
If this had existed a week ago, we’d have known within an hour instead of seven days.
Real‑time Status Events
We added two events that fire when the user’s browser receives a status change via Supabase Realtime:
// ProjectStatusListener.tsx
const channel = supabase
.channel(`project-${projectId}`)
.on(
"postgres_changes",
{ /* … */ },
(payload) => {
if (payload.new?.status === "completed") {
gaEvent("video_generate_complete", { project_id: projectId });
} else if (payload.new?.status === "failed") {
gaEvent("video_generate_fail", { project_id: projectId });
}
}
)
.subscribe();
Funnel Query in BigQuery
-- Start‑to‑complete ratio per day
SELECT
event_date,
COUNT(DISTINCT CASE WHEN event_name = 'video_generate_start'
THEN user_pseudo_id END) AS start_users,
COUNT(DISTINCT CASE WHEN event_name = 'video_generate_complete'
THEN user_pseudo_id END) AS complete_users
FROM events_*
GROUP BY event_date;
A sudden drop in the complete/start ratio now appears as a clear signal.
Cost Discovery
While investigating, we realized every free‑tier user was receiving the same “Kling 3.0 Pro” clips as paying customers.
- Cost per video (Pro): ≈ $5.60
- Conversion rate: ≈ 3 %
- Result: Unsustainable customer‑acquisition cost
The Fix
| Plan | Clip Type | Clips | Approx. Length | Cost per Video |
|---|---|---|---|---|
| Free | Kling 3.0 Standard | 3 | ~15 s | $2.52 |
| Paid | Kling 3.0 Pro | 5 | ~25 s | $5.60 |
- Turning “Kling 3.0 Pro quality” into a tangible upgrade incentive.
- Cutting free‑tier costs by 55 %.
Lessons Learned
- Env vars are a silent single point of failure – automate their lifecycle.
- Background‑job failures are invisible by default – add explicit monitoring for “things that should have finished but didn’t.”
- Track completion, not just initiation – the absence of
video_generate_completedata is a critical signal. - Persist intermediate results – allowed recovery without extra Fal.ai charges.
- Billing anomalies are monitoring signals – set up alerts on unexpected spend patterns.
Try RepoClip
RepoClip generates AI‑powered promotional videos from GitHub repositories.
Paste any public repo URL and get a video in minutes — free, no credit card required.