Don't Build Just Another Chatbot: Architecting a 'Duolingo-Style' AI Companion with Rive
Source: Dev.to
We are drowning in “AI Wrappers.” If you are building an AI language tutor, a role‑play app, or a mental‑health companion, you have a problem: text interfaces are boring.
The apps winning the race right now (like Duolingo’s Lily or character.ai) aren’t just outputting tokens; they are rendering performance.
As a Rive animator who specializes in AI interactions, I’ve seen the backend of many of these projects. The difference between a “toy” app and a “product” usually comes down to one thing: the lip‑sync architecture.
In this post I’ll break down the technical setup required to build a reactive, lip‑syncing AI character using Rive, moving beyond simple volume‑bouncing to phoneme‑accurate speech.
The Architecture: Puppet vs. Puppeteer
To build a character that feels alive, separate concerns:
- The Puppet (Rive) – a state machine that handles morphing shapes based on numeric inputs.
- The Puppeteer (Your Code) – React/Flutter/Swift logic that parses audio and sends signals to the puppet.
Level 1: The “Muppet” Method (Amplitude)
Fast way. If you need an MVP tomorrow, start here. Analyze the Root Mean Square (RMS) of the audio amplitude.
Rive setup: a 1‑D blend state. Input 0 = mouth closed, input 100 = mouth wide open.
// Example (pseudo‑code)
riveInput.value = normalizedVolume;
Problem: It looks like a Muppet. The character opens its mouth wide for “OO” and “EE” sounds alike, lacking nuance.
Level 2: The “Viseme” Method (Phonetic Mapping)
The Duolingo way. Stop using volume; use visemes—the visual equivalents of phonemes. Many TTS providers (Azure Speech SDK, AWS Polly) return viseme events—integers that describe mouth shape at a specific timestamp.
The Rive State Machine
Instead of a single “Mouth Open” blend, build a state machine with ~12‑15 discrete mouth shapes, e.g.:
| Viseme | Description |
|---|---|
| Sil | Silence / idle |
| PP | Lips pressed – P, B, M |
| FF | Teeth on lip – F, V |
| TH | Tongue out – TH |
| DD | Tongue behind teeth – T, D, S |
| kk | Open back – K, G |
| aa | Wide – A |
| O | Round – O |
| … | (and so on) |
Map these to a Number Input called viseme_id.
The Code Logic
In your frontend (React Native, Flutter, etc.), listen for viseme events and push them to Rive:
ttsService.on('visemeReceived', (visemeID) => {
// 1. Get the Rive input
const mouthInput = riveArtboard.findInput('viseme_id');
// 2. Map the TTS provider's ID to your Rive ID
// (Azure has 21 shapes, Rive might only need 12)
const mappedID = mapAzureToRive(visemeID);
// 3. Update the state
mouthInput.value = mappedID;
});
The Secret: Layered Micro‑Behaviors
Lip sync is only ~50 % of the illusion. If the character stares unblinkingly while talking, it falls into the uncanny valley.
Solution: Use layered state machines in Rive so multiple timelines play simultaneously without conflict.
- Layer 1 – Mouth (controlled by code).
- Layer 2 – Eyes (self‑contained loop). A “Randomize” listener inside Rive triggers a blink or eye‑dart every 2–5 seconds automatically.
- Layer 3 – Emotions (boolean inputs such as
isBored,isHappy,isThinking).
Handling “The Pause” (Latency)
The biggest UX killer in AI voice chat is the 2–3 seconds of silence while the LLM generates an answer. The character must not freeze.
- User stops talking → app sets
isThinking = true. - Rive animation – character looks up, taps a finger, or (for a sarcastic persona) rolls eyes.
- Audio stream starts → set
isThinking = false; viseme data resumes flowing.