Connecting AI Voice Agents to SIP & PSTN Using NextGenSwitch

Published: (February 6, 2026 at 12:14 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

Bridging AI Voice Agents with Real Phone Calls

Building an AI voice agent is relatively easy today.
Connecting that agent to real phone calls (SIP, PBX, PSTN) is not. Most AI voice systems are designed to work with WebSockets and raw audio streams, while production telephony still relies on SIP, RTP, and PSTN infrastructure. This mismatch is where many voice‑AI projects struggle to move beyond demos.

This post explains how NextGenSwitch bridges that gap—allowing any AI voice system to interact with real phone callers using a Twilio‑style streaming interface, without exposing SIP or RTP complexity to AI developers.


The Core Problem

AI voice systems typically expect:

WebSocket → PCM audio → AI pipeline → PCM audio

Telephony systems operate very differently:

PSTN → SIP Trunk → PBX → RTP (μ-law / A-law)

Key challenges

  • SIP and RTP are stateful and codec‑sensitive
  • AI systems expect clean, ordered audio frames
  • Handling barge‑in, latency, and scaling is non‑trivial
  • Most AI frameworks are not PBX‑aware

The Role of NextGenSwitch

NextGenSwitch acts as a telephony abstraction layer between traditional phone systems and modern AI services.

It provides:

  • SIP & PSTN termination
  • Integration with PBX systems (Asterisk / FreeSWITCH)
  • A Twilio‑style Programmable Voice API
  • Real‑time WebSocket audio streaming
  • Codec and sample‑rate normalization

Your AI service never has to interact directly with SIP or RTP.


High‑Level Architecture

Caller
|
[PSTN / SIP Trunk]
|
[Asterisk / FreeSWITCH]
|
[NextGenSwitch]
| 
|
[AI Voice Service]

The AI voice service can be:

  • A custom WebSocket server
  • A cloud‑based AI endpoint
  • An on‑prem STT + LLM + TTS stack
  • Any framework capable of handling real‑time audio

Twilio‑Style XML Call Control

When a call reaches NextGenSwitch, it fetches XML instructions—similar to Twilio’s TwiML.

Minimal XML (only the stream URL is required)


  
    
  

This instruction:

  • Answers the call
  • Opens a bidirectional WebSocket
  • Starts real‑time audio streaming

Optional Parameters (examples only)

Parameters are not mandatory; they are passed as metadata to your AI service.


  
    
      
      
      
    
  

These values appear in the JSON start event and can be used for routing, prompts, or CRM lookups.


WebSocket Streaming Protocol (JSON)

NextGenSwitch uses a Twilio Media Streams‑style JSON protocol. Your AI service only needs to handle a small set of events.

start Event

Sent once when the stream begins.

{
  "event": "start",
  "streamId": "NGS_STREAM_123456",
  "start": {
    "callId": "NGS_CALL_abc",
    "from": "+8801XXXXXXXXX",
    "to": "5000",
    "customParameters": {
      "agent": "support-bot",
      "tenant_id": "company-01"
    }
  }
}

Save the streamId—it must be included in outbound audio messages.

media Event (Inbound Audio)

{
  "event": "media",
  "streamId": "NGS_STREAM_123456",
  "media": {
    "payload": "BASE64_AUDIO_BYTES=="
  }
}

Audio characteristics

  • Codec: G.711 μ-law
  • Sample rate: 8 kHz
  • Payload: base64‑encoded audio frames

media Event (Outbound Audio)

Your AI service responds using the same structure:

{
  "event": "media",
  "streamId": "NGS_STREAM_123456",
  "media": {
    "payload": "BASE64_AUDIO_BYTES=="
  }
}

NextGenSwitch converts this audio back into telephony format and sends it to the caller.

stop Event

{
  "event": "stop",
  "streamId": "NGS_STREAM_123456",
  "stop": {
    "reason": "hangup"
  }
}

AI Stack: Fully Flexible

NextGenSwitch does not require any specific AI framework. You can use:

  • Any STT engine
  • Any LLM
  • Any TTS engine
  • Any programming language

Reference implementations (e.g., Pipecat) are optional, not required.


Why This Architecture Works

  • No SIP or RTP handling in AI code
  • Twilio‑style, developer‑friendly interface
  • Real‑time, low‑latency audio streaming
  • Vendor‑neutral AI integration
  • Production‑ready PSTN scalability

Common Use Cases

  • AI receptionist
  • AI call‑center agents
  • Voice‑based order processing
  • Appointment booking
  • IVR replacement
  • Multilingual voice bots

Key Takeaways

  • Only the “ element is mandatory
  • XML parameters are optional metadata
  • Streaming protocol follows Twilio‑style JSON
  • Telephony audio uses μ‑law @ 8 kHz
  • AI logic is completely decoupled from PBX logic

Learn More

  • Programmable Voice Stream API:
  • AI streaming examples:
Back to Blog

Related posts

Read more »

💀 Modern Malware’s Anti-Forensics

Abstract High‑Retention Hook pslist, netscan, hashdump. The results came back suspiciously clean: zero network connections, no unfamiliar processes, and no obv...