Connecting AI Voice Agents to SIP & PSTN Using NextGenSwitch

Published: 1 hour ago (February 6, 2026 at 12:14 AM EST)

3 min read

Source: Dev.to

Bridging AI Voice Agents with Real Phone Calls

Building an AI voice agent is relatively easy today.
Connecting that agent to real phone calls (SIP, PBX, PSTN) is not. Most AI voice systems are designed to work with WebSockets and raw audio streams, while production telephony still relies on SIP, RTP, and PSTN infrastructure. This mismatch is where many voice‑AI projects struggle to move beyond demos.

This post explains how NextGenSwitch bridges that gap—allowing any AI voice system to interact with real phone callers using a Twilio‑style streaming interface, without exposing SIP or RTP complexity to AI developers.

The Core Problem

AI voice systems typically expect:

WebSocket → PCM audio → AI pipeline → PCM audio

Telephony systems operate very differently:

PSTN → SIP Trunk → PBX → RTP (μ-law / A-law)

Key challenges

SIP and RTP are stateful and codec‑sensitive
AI systems expect clean, ordered audio frames
Handling barge‑in, latency, and scaling is non‑trivial
Most AI frameworks are not PBX‑aware

The Role of NextGenSwitch

NextGenSwitch acts as a telephony abstraction layer between traditional phone systems and modern AI services.

It provides:

SIP & PSTN termination
Integration with PBX systems (Asterisk / FreeSWITCH)
A Twilio‑style Programmable Voice API
Real‑time WebSocket audio streaming
Codec and sample‑rate normalization

Your AI service never has to interact directly with SIP or RTP.

High‑Level Architecture

Caller
|
[PSTN / SIP Trunk]
|
[Asterisk / FreeSWITCH]
|
[NextGenSwitch]
| 
|
[AI Voice Service]

The AI voice service can be:

A custom WebSocket server
A cloud‑based AI endpoint
An on‑prem STT + LLM + TTS stack
Any framework capable of handling real‑time audio

Twilio‑Style XML Call Control

When a call reaches NextGenSwitch, it fetches XML instructions—similar to Twilio’s TwiML.

Minimal XML (only the stream URL is required)

This instruction:

Answers the call
Opens a bidirectional WebSocket
Starts real‑time audio streaming

Optional Parameters (examples only)

Parameters are not mandatory; they are passed as metadata to your AI service.

These values appear in the JSON start event and can be used for routing, prompts, or CRM lookups.

WebSocket Streaming Protocol (JSON)

NextGenSwitch uses a Twilio Media Streams‑style JSON protocol. Your AI service only needs to handle a small set of events.

`start` Event

Sent once when the stream begins.

{
  "event": "start",
  "streamId": "NGS_STREAM_123456",
  "start": {
    "callId": "NGS_CALL_abc",
    "from": "+8801XXXXXXXXX",
    "to": "5000",
    "customParameters": {
      "agent": "support-bot",
      "tenant_id": "company-01"
    }
  }
}

Save the streamId—it must be included in outbound audio messages.

`media` Event (Inbound Audio)

{
  "event": "media",
  "streamId": "NGS_STREAM_123456",
  "media": {
    "payload": "BASE64_AUDIO_BYTES=="
  }
}

Audio characteristics

Codec: G.711 μ-law
Sample rate: 8 kHz
Payload: base64‑encoded audio frames

`media` Event (Outbound Audio)

Your AI service responds using the same structure:

{
  "event": "media",
  "streamId": "NGS_STREAM_123456",
  "media": {
    "payload": "BASE64_AUDIO_BYTES=="
  }
}

NextGenSwitch converts this audio back into telephony format and sends it to the caller.

`stop` Event

{
  "event": "stop",
  "streamId": "NGS_STREAM_123456",
  "stop": {
    "reason": "hangup"
  }
}

AI Stack: Fully Flexible

NextGenSwitch does not require any specific AI framework. You can use:

Any STT engine
Any LLM
Any TTS engine
Any programming language

Reference implementations (e.g., Pipecat) are optional, not required.

Why This Architecture Works

No SIP or RTP handling in AI code
Twilio‑style, developer‑friendly interface
Real‑time, low‑latency audio streaming
Vendor‑neutral AI integration
Production‑ready PSTN scalability

Common Use Cases

AI receptionist
AI call‑center agents
Voice‑based order processing
Appointment booking
IVR replacement
Multilingual voice bots

Key Takeaways

Only the “ element is mandatory
XML parameters are optional metadata
Streaming protocol follows Twilio‑style JSON
Telephony audio uses μ‑law @ 8 kHz
AI logic is completely decoupled from PBX logic

Learn More

Programmable Voice Stream API:
AI streaming examples:

Connecting AI Voice Agents to SIP & PSTN Using NextGenSwitch

Bridging AI Voice Agents with Real Phone Calls

The Core Problem

The Role of NextGenSwitch

High‑Level Architecture

Twilio‑Style XML Call Control

Minimal XML (only the stream URL is required)

Optional Parameters (examples only)

WebSocket Streaming Protocol (JSON)

`start` Event

`media` Event (Inbound Audio)

`media` Event (Outbound Audio)

`stop` Event

AI Stack: Fully Flexible

Why This Architecture Works

Common Use Cases

Key Takeaways

Learn More

Related posts

Взломы и вирусы в 1С-Битрикс: реальные уязвимости и как их лечат

I Built a YouTube Shorts Generator with AI — Here's How

💀 Modern Malware’s Anti-Forensics

AI attention span so good it shouldn’t be legal

Bridging AI Voice Agents with Real Phone Calls

The Core Problem

The Role of NextGenSwitch

High‑Level Architecture

Twilio‑Style XML Call Control

Minimal XML (only the stream URL is required)

Optional Parameters (examples only)

WebSocket Streaming Protocol (JSON)

start Event

media Event (Inbound Audio)

media Event (Outbound Audio)

stop Event

AI Stack: Fully Flexible

Why This Architecture Works

Common Use Cases

Key Takeaways

Learn More

Related posts

Взломы и вирусы в 1С-Битрикс: реальные уязвимости и как их лечат

I Built a YouTube Shorts Generator with AI — Here's How

💀 Modern Malware’s Anti-Forensics

AI attention span so good it shouldn’t be legal

`start` Event

`media` Event (Inbound Audio)

`media` Event (Outbound Audio)

`stop` Event