We Built a Full-Stack AI Music Agent with Next.js — Here's What We Learned

Published: (February 14, 2026 at 05:59 PM EST)
8 min read
Source: Dev.to

Source: Dev.to

The Stack

ComponentTechnology
FrameworkNext.js 16 (App Router)
AuthClerk
PaymentsStripe
AudioWeb Audio API + WaveSurfer.js
AICustom agent orchestrating multiple music AI providers
i18nnext-intl (32 languages)
StateZustand + TanStack Query
UIRadix primitives + Tailwind
HostingVercel + S3‑compatible object storage

Lesson 1: Streaming AI Responses Requires Rethinking Your Data Flow

When a user says “make me a lo‑fi beat with jazz piano,” the AI agent doesn’t just return text — it generates a song, creates cover art, extracts metadata, and streams progress updates back to the UI, all in a single conversation turn.

The naive approach is to wait for the entire response and then render. But music generation takes 30–120 seconds. You need to stream.

What we learned

  • Server‑Sent Events (SSE) over fetch – not WebSockets. For a conversational AI interface, SSE is simpler and works perfectly with Vercel’s serverless model. WebSockets would require a persistent connection and a separate infrastructure layer.
// Simplified streaming pattern
const response = await fetch('/api/agent', {
  method: 'POST',
  body: JSON.stringify({ message: userInput }),
});

const reader = response.body?.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = decoder.decode(value);
  // Parse SSE events: text deltas, resource creation, progress updates
  processStreamEvents(chunk);
}
  • State management during a stream – when the agent creates a new audio resource mid‑stream you must:

    1. Update the chat message (append text)
    2. Add the new resource to the resource panel
    3. Trigger a waveform render for the new audio
    4. Update the credit balance

    All of this needs to happen smoothly without re‑renders that cause audio playback glitches.

What we’d do differently: Design your state management around streaming from day one. We started with simple useState and had to refactor to Zustand stores + refs to avoid cascade re‑renders during active streams.

Lesson 2: Browser Audio Processing Is Harder Than You Think

The studio includes a real‑time mastering chain — EQ, compression, stereo width, limiter — all running in the browser via the Web Audio API. Users can tweak mastering settings and hear changes in real time, then export the mastered MP3.

Real‑time vs. offline rendering

Goal: Real‑time playback and offline rendering must produce identical output.

// The mastering pipeline (simplified)
async function renderMasteredBuffer(
  audioUrl: string,
  settings: MasteringSettings
): Promise {
  const offlineCtx = new OfflineAudioContext(
    2,                    // stereo
    sampleRate * duration,
    sampleRate
  );

  // Build the same effect chain used in real‑time playback
  const source = offlineCtx.createBufferSource();
  const eq = createParametricEQ(offlineCtx, settings.eq);
  const compressor = createCompressor(offlineCtx, settings.compression);
  const limiter = createLimiter(offlineCtx, settings.limiter);

  source.connect(eq).connect(compressor).connect(limiter).connect(offlineCtx.destination);
  source.start(0);

  return offlineCtx.startRendering();
}

Gotcha: OfflineAudioContext and a regular AudioContext can produce subtly different results if filter frequencies or parameter ramps aren’t identical. We extracted all shared constants into a single TypeScript file to guarantee bit‑perfect parity.

MP3 encoding in the browser

We use lamejs (a JavaScript LAME port) to encode AudioBuffers to MP3 client‑side, avoiding a round‑trip to the server. However, lamejs is CPU‑intensive — encoding a 3‑minute song can block the main thread for 2–3 seconds.

Fix: Process in chunks and yield back to the event loop.

async function encodeToMp3(audioBuffer: AudioBuffer): Promise {
  const mp3encoder = new lamejs.Mp3Encoder(2, audioBuffer.sampleRate, 192);
  const chunks: Int8Array[] = [];
  const blockSize = 1152;

  for (let i = 0; i  0) chunks.push(mp3buf);

    // Yield to prevent UI freeze
    if (i % (blockSize * 100) === 0) {
      await new Promise(resolve => setTimeout(resolve, 0));
    }
  }

  const end = mp3encoder.flush();
  if (end.length > 0) chunks.push(end);

  return new Blob(chunks, { type: 'audio/mp3' });
}

Lesson 3: File Uploads on Vercel Have a Hidden Limit

Vercel serverless functions impose a 4.5 MB body size limit. That sounds fine until you realize a single mastered audio file is easily 5–10 MB.

Our first approach was client → Next.js API route → object storage. This broke immediately for any real audio file.

The solution: direct client‑to‑storage uploads with pre‑signed URLs

1. Client requests a signed upload URL from our API (tiny JSON payload)
2. Client uploads the file dire

(The rest of the flow continues as usual: the client PUTs the file to the storage endpoint, then notifies our backend that the upload is complete.)

Lesson 3 – Bypass Vercel’s Body‑Size Limit for Large Uploads

When you need to upload files larger than Vercel’s 4.5 MB request limit, the simplest pattern is a direct‑to‑object‑storage upload using a signed URL.

  1. Client requests a signed upload URL from an API route.
  2. Client uploads the file directly to the storage service (S3, Cloudflare R2, etc.).
  3. Client sends the resulting public URL back to the API (tiny JSON payload).
  4. API updates the database with the file metadata.

All steps stay well under the 4.5 MB limit; the heavy file transfer bypasses Vercel entirely.

// Upload flow that bypasses Vercel's body limit
export async function uploadFileToStorageFromClient({
  file,
  filename,
  key,
}: {
  file: Blob;
  filename: string;
  key: string;
}): Promise {
  // Step 1: Get signed URL (tiny request)
  const tokenResp = await fetch('/api/upload/token', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ key, filename, contentType: file.type }),
  });
  const { uploadUrl, publicUrl } = await tokenResp.json();

  // Step 2: Upload directly to object storage (no Vercel in the middle)
  await fetch(uploadUrl, {
    method: 'PUT',
    body: file,
    headers: { 'Content-Type': file.type },
  });

  return { url: publicUrl };
}

This pattern is essential for any media‑heavy app on Vercel.

Lesson 4 – i18n at Scale Is a Product Decision, Not a Technical One

Gliss supports 32 languages (not 3 or 5). Below is the i18n setup:

// routing.ts
import { defineRouting } from 'next-intl/routing';

export const routing = defineRouting({
  locales: SUPPORTED_LOCALE_CODES, // 32 locales
  defaultLocale: 'en',
  localePrefix: 'as-needed', // No /en prefix for English
});

The localePrefix: 'as-needed' eliminated a ~790 ms redirect from //en, giving a Lighthouse win.

Practical lessons

  • Use AI for the initial translation pass, then have native speakers review. Pure AI translation makes embarrassing mistakes with music terminology.
  • Keep English terms for industry jargon (e.g., “mastering,” “stems,” “BPM,” “MIDI”). Musicians worldwide use these terms.
  • RTL languages (Arabic, Hebrew, Urdu, Persian) need layout testing, not just translation. Flex layouts can break; test thoroughly.
  • Don’t translate dynamically. Load all translations at build time. next-intl’s server components avoid shipping translation bundles to the client unnecessarily.

Lesson 5 – Content Security Policy Will Break Everything You Love

Adding a proper CSP header inevitably starts a day of “whack‑a‑mole.” Every external script, font, analytics pixel, and auth widget needs explicit permission:

value: [
  "default-src 'self'",
  "script-src 'self' 'unsafe-eval' 'unsafe-inline' https://your-auth-provider.com https://*.yourdomain.com",
  "connect-src 'self' https://*.yourdomain.com https: blob: data: wss:",
  "style-src 'self' 'unsafe-inline' https://fonts.googleapis.com",
  "font-src 'self' data: https://fonts.gstatic.com",
  "media-src 'self' https: blob: data:",
  "worker-src 'self' blob:",
].join('; ')

The blob: and data: entries in media-src are crucial for audio apps — the Web Audio API creates blob URLs for playback, and OfflineAudioContext renders to data URIs.

Do it anyway. CSP is non‑negotiable for production apps handling payments and user data.

Lesson 6 – Optimizing Bundle Size With Next.js

Our initial bundle shipped the entirety of react‑icons, which is massive. Enabling Next.js’s optimizePackageImports gave us a big win:

experimental: {
  optimizePackageImports: [
    'react-icons/si',
    'react-icons/fa6',
    'react-icons/md',
    'react-icons/lu',
    'lucide-react',
    '@clerk/nextjs',
  ],
},

This tells Next.js to tree‑shake these packages more aggressively. For react-icons alone it cut ~200 KB from the bundle.

Other wins

  • inlineCss: true – eliminates the separate CSS request, reducing time‑to‑first‑paint.
  • Lazy‑load heavy viewers (MIDI viewer, waveform renderer) with next/dynamic.

What We’d Do Differently

  • Start with a streaming architecture. Retrofitting streaming into a request‑response mental model is painful.
  • Use S3‑compatible direct uploads from day 1. Don’t route binary files through your API layer.
  • Set up CSP on day 1. Adding it later means debugging every third‑party integration you’ve already embedded.
  • Invest in i18n infrastructure early. Adding a 32nd language is easy when your pipeline is automated; adding a 2nd language with hard‑coded strings everywhere is a nightmare.
  • Build your audio pipeline with OfflineAudioContext first, then port to real‑time. Getting offline rendering right guarantees your real‑time version will be correct.

Try It

If you want to see all of this in action, check out Gliss. You can generate a song from a text description, master it in your browser, and export — no account required for your first few creations.

The music‑AI space is moving incredibly fast. If you’re building anything with audio in the browser, we hope these lessons save you the debugging time we spent.

What’s the hardest technical challenge you’ve hit building with audio in the browser? We’d love to hear about it in the comments.

0 views
Back to Blog

Related posts

Read more »

The Vonage Dev Discussion

Dev Discussion We want it to be a space where we can take a break and talk about the human side of software development. First Topic: Music 🎶 Speaking of musi...

MLflow: primeiros passos em MLOps

Introdução Alcançar uma métrica excelente em um modelo de Machine Learning não é uma tarefa fácil. Imagine não conseguir reproduzir os resultados porque não le...