How I Built SilentEar — A Real-Time AI Accessibility Agent for Deaf Users with Gemini Live API
Source: Dev.to
Why I Built This
My son was born profoundly deaf. As he began learning Pakistan Sign Language (PSL) at school, I started basic training alongside him. Through this journey I saw how daily life can be isolating and even dangerous for Deaf individuals and their families:
- Feeling irrelevant or isolated in group settings
- Struggling to communicate smoothly with hearing people
- Missing critical environmental alerts (fire alarms, door knocks, baby cries, etc.)
While AI has advanced dramatically, most accessibility tools still rely on simple speech‑to‑text transcription. Transcripts miss the context that makes a sound urgent. Inspired by my son’s experience and the power of the Gemini Live API, I set out to build an agent that listens, interprets, and delivers life‑saving cues in formats Deaf users actually need (haptic feedback, screen flashes, visual sign language).
What SilentEar Does
SilentEar continuously monitors ambient audio, extracts meaning, and alerts the user in real time.
| Core Capability | How It Works |
|---|---|
| Environmental sound detection | Gemini Live API streams bidirectional audio; function calling (trigger_alert) fires custom alerts (dog bark, doorbell, siren, name‑call, etc.). |
| Context‑aware transcription | Gemini 3 Flash refines noisy speech into clean sentences and adds scene intelligence. |
| Visual sign language support | SignMoji – a library of sign‑language videos that appear with alerts. Users can add custom SignMojis via video upload, URL, or web search. |
| Two‑way communication | AI‑powered Voice Deck provides text‑to‑speech with smart, context‑aware phrase prediction (Gemini 3 Flash). |
| Caregiver dashboard | Trusted contacts can view live alerts, device status, and history remotely. |
Architecture Overview
Frontend
- Framework: React 19 + TypeScript (PWA)
- Styling: Tailwind CSS
- Audio processing: Web Audio API + local FFT for ultra‑low‑latency alarm detection
Backend
- Runtime: Node.js + Express on Google Cloud Run
- Streaming: WebSocket proxy that forwards PCM audio (16 kHz) to Gemini Live API
- AI integration:
@google/genaiSDK – live audio streaming, function calling (trigger_alert), and REST calls to Gemini 3 Flash for transcript refinement & scene analysis
Data & Media
| Service | Role |
|---|---|
| Supabase (PostgreSQL + Realtime + Storage) | User profiles, custom SignMoji libraries, trigger definitions, caregiver sync |
| Cloud Firestore | Alert history, device status, trigger configurations |
| Google Cloud Run | Hosts Express + WebSocket backend, runs server‑side REST endpoints for Gemini 3 Flash processing |
Audio Flow
Device Microphone → PCM Audio (16 kHz) → WebSocket → Cloud Run → Gemini Live API
↓
Haptic + Visual Alerts ← Function Call (trigger_alert)Gemini Live Function Calling
Instead of naïve keyword matching, SilentEar gives Gemini a trigger_alert tool that knows the user’s custom categories. When the model hears a matching sound or phrase, it calls the tool, instantly notifying the device.
const triggerTool: FunctionDeclaration = {
name: 'trigger_alert',
description: 'Call this when an environmental sound or keyword matches alert categories.',
parameters: {
type: Type.OBJECT,
properties: {
alert_id: {
type: Type.STRING,
description: 'The ID of the alert to trigger.'
},
context: {
type: Type.STRING,
description: 'Short summary of what was heard.'
}
},
required: ['alert_id']
}
};Result: Gemini distinguishes a dog barking on TV from a real dog at the door, reducing false alarms dramatically.
Gemini 3 Flash Enhancements
| Feature | Benefit |
|---|---|
| Scene Analysis | Periodic summaries (“Two people are talking nearby. Someone mentioned your name.”) |
| Transcript Refinement | Turns choppy fragments into clean, readable sentences |
| Trigger Auto‑Discovery | Analyzes ambient patterns and suggests new alert categories for the user |
All of these run as lightweight REST endpoints on Cloud Run, keeping the mobile client fast and responsive.
Full Stack Diagram (Simplified)
+----------------+ WebSocket +----------------+ Gemini Live API
| Mobile Device | ──────────────────► | Cloud Run | ──────────────────► |
| (React PWA) | | Express WS | |
+----------------+ +----------------+ |
│ │ |
│ ▼ |
│ +----------------+ |
│ | Gemini 3 Flash| |
│ +----------------+ |
│ │ |
▼ ▼ ▼
Haptic / Visual Alerts Refined Transcripts Scene SummariesGetting Started (Quick‑Start)
Clone the repo
git clone https://github.com/your‑username/silent‑ear.git cd silent-earSet up environment variables (
.env.local)GOOGLE_API_KEY=your_google_api_key SUPABASE_URL=... SUPABASE_ANON_KEY=... FIRESTORE_PROJECT_ID=...Run locally
# Frontend npm install && npm run dev # Backend cd backend && npm install && npm startDeploy (optional) – push the backend to Cloud Run and the frontend to Firebase Hosting or any static‑site host.
Closing Thoughts
SilentEar shows how context‑aware AI can move beyond transcription to truly interpret the world for Deaf users. By leveraging Gemini Live’s streaming + function calling and Gemini 3 Flash’s scene intelligence, we deliver timely, multimodal alerts that keep users safe and connected.
If you’re interested in collaborating, testing, or extending the platform, feel free to open an issue or reach out directly.
Trigger Configurations
Gemini Live API
Real‑time bidirectional audio streaming with tool calling
Gemini 3 Flash
Scene intelligence, NLP post‑processing
Cloud Build
Automated CI/CD pipeline (Docker build → deploy)
Automated Deployment
Deployment is fully automated via a single cloudbuild.yaml file:
steps:
- name: 'gcr.io/cloud-builders/docker'
args: ['build', '-t', 'gcr.io/$PROJECT_ID/silentear-backend', '.']
- name: 'gcr.io/cloud-builders/docker'
args: ['push', 'gcr.io/$PROJECT_ID/silentear-backend']
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: gcloud
args: [
'run', 'deploy', 'silentear-backend',
'--image=gcr.io/$PROJECT_ID/silentear-backend',
# …additional flags…
]A single gcloud builds submit command builds the Docker image and deploys it to Cloud Run—zero manual steps.
SilentEar – Not Just a Demo
SilentEar is a production‑ready app built for real deaf users, featuring:
- Customizable Triggers – Users define their own alert words (doorbell, fire, baby, their name) with unique vibration patterns and colors.
- Sign Language Videos – Alerts can include ASL, BSL, or PSL sign‑language video demonstrations.
- SignMoji – A companion sign‑language library where users can record, search, or link sign videos with AI‑generated icons, synced across devices.
- Voice Deck – A text‑to‑speech tool with AI‑powered phrase suggestions, letting deaf users “speak” through their device.
- Caregiver Dashboard – Family members monitor alerts in real time via Supabase real‑time subscriptions.
- Offline Mode – Falls back to the browser Speech Recognition API when the cloud isn’t available.
- Multi‑Language – Supports 10 languages for transcript processing.
“I’m especially proud of how seamless the SignMoji integration feels. Allowing users to instantly search the web, record their own sign‑language videos, and sync them securely into their trigger system makes the platform deeply personal and culturally meaningful. Achieving ultra‑low latency alerts through Gemini Live function calling also feels transformative in real‑world testing.”
Technical Highlights & Learnings
- Web Audio API & Real‑Time Streaming – Gained deep experience with the Web Audio API and the constraints of real‑time streaming in modern browsers.
- Accessibility‑First Development – Learned the nuance of Deaf culture: transcription alone is insufficient; combining environmental intelligence, visual signals, haptics, and sign language is essential for true inclusion.
Challenges Overcome
- WebSocket Session Management on Cloud Run – Ensured stable, long‑lived connections despite Cloud Run’s request‑based scaling.
- Audio Format Compatibility – The browser captures audio as Float32 PCM, while Gemini expects specific formats. Implemented a real‑time PCM encoder that converts and chunks audio for optimal streaming.