How we shaved AI voice latency to <800ms (and what we'd do differently)
A founder-engineer's post-mortem from the WebCallHub.ai team. The breakdown of every millisecond. Why we use Deepgram + Whisper as primary/fallback. Multi-model LLM routing. The optimizations that didn't work.
Why <800ms matters
Below ~1 second, AI voice starts to feel like a conversation. Above ~1.5 seconds, it feels like talking to someone with a bad WiFi connection — you wait, then they wait, then it goes wrong. Research from MIT and Bell Labs from the 1970s established that human conversational turn-taking averages 200ms. Anything above 800ms feels noticeably "robotic."
So 800ms isn't a marketing number. It's the threshold at which an AI agent can pass the "feels human" test.
The latency budget
Here's our actual end-to-end breakdown for a single utterance (visitor speaks → AI responds with first audio byte):
| Stage | Budget | Notes |
|---|---|---|
| Network RTT (visitor → our edge) | ~80ms | Geo-distributed TURN |
| VAD + end-of-speech detection | ~50ms | WebRTC native VAD + custom silence detector |
| STT (streaming, partial → final) | ~180ms | Deepgram Nova-2 / Whisper-large fallback |
| LLM time-to-first-token (TTFT) | ~250ms | Claude Sonnet usually; varies per model |
| TTS first-byte (streaming) | ~120ms | ElevenLabs Turbo (sentence-level) |
| Audio buffer + playback start | ~80ms | jitter buffer minimum |
| Total to first audio byte | ~760ms | Median of 1000 test calls |
STT: why we use Deepgram + Whisper (both)
Deepgram is the primary. Their Nova-2 streaming model gives partial results in <200ms, with interim transcripts updating every 50-100ms. We don't wait for the final result — we feed partials into the LLM as soon as we have a coherent sentence boundary.
The tradeoff: Deepgram is fast but their Finnish/Polish accuracy is noticeably lower than Whisper-large. So we route per-language:
# Pseudo-code from our STT router
def route_stt(audio_stream, detected_language):
if detected_language in DEEPGRAM_STRONG_LANGUAGES:
return DeepgramStream(audio_stream, model="nova-2")
elif detected_language in WHISPER_BETTER_LANGUAGES:
# Self-hosted Whisper on Modal GPUs
return WhisperStream(audio_stream, model="large-v3")
else:
return DeepgramStream(audio_stream, model="nova-2-general")
Self-hosting Whisper isn't free (~$0.30/hr per concurrent stream on Modal A100s), but Finnish accuracy is night-and-day vs Deepgram for our Nordic customers.
The end-of-speech problem
The hardest part isn't transcribing speech — it's knowing when the user stopped talking. Get it wrong and the AI either (a) interrupts the user mid-sentence, or (b) waits awkwardly for ~2 seconds.
Our approach: dual-signal end-of-speech detection.
- VAD signal: WebRTC native VAD reports voice-off events at ~200ms granularity. We start a 400ms silence timer.
- LLM signal: We also feed the partial STT into a small classifier (fine-tuned BERT, ~40M params, runs in 20ms) that predicts "is this utterance complete?" based on syntactic completeness, intonation markers from STT, etc.
If either signal fires high-confidence "user done," we cut the silence timer and start LLM inference immediately. This shaves about 200-300ms off the conversational gap without introducing false-positive interruptions.
LLM: multi-model routing
We don't use one LLM. We use three, routed per request:
| Model | TTFT | Use case |
|---|---|---|
| Claude Sonnet (Anthropic) | ~250ms | Nuanced conversations, complex reasoning, safety-sensitive |
| GPT-4o (OpenAI) | ~280ms | Tool use, function calling, structured output |
| Llama-3.1-70B (fine-tuned, self-hosted) | ~150ms | High-volume tier-1 support, FAQ answering |
The routing logic is deceptively complex. Simplified:
def route_llm(transcript, agent_config, conversation_history):
# If this is a tool-calling conversation, use GPT-4o
if agent_config.has_active_tools and detect_tool_intent(transcript):
return OpenAIClient(model="gpt-4o", stream=True)
# If it's a high-volume known-intent query (FAQ-bot territory), use Llama
if intent_classifier.predict(transcript) in FAQ_INTENTS:
return LlamaClient(model="llama-3.1-70b-finetuned", stream=True)
# Default: Claude for everything else
return AnthropicClient(model="claude-sonnet-3.5", stream=True)
Llama on our own GPUs is the cheapest and fastest for ~60% of calls (FAQ-shaped queries). Claude is the fallback for everything else.
TTS: sentence-level streaming
The naive approach: wait for the full LLM response, send it to TTS, get audio back, play it. That adds ~500ms of waiting (the LLM finishing its response).
Our approach: stream LLM tokens → segment into sentences → send each sentence to ElevenLabs as soon as a sentence boundary lands → start playback on the visitor's browser as soon as the first audio chunk arrives.
async def stream_to_user(llm_stream):
sentence_buffer = ""
async for token in llm_stream:
sentence_buffer += token
if has_sentence_boundary(sentence_buffer):
# Fire-and-forget TTS for this sentence
asyncio.create_task(tts_and_play(sentence_buffer))
sentence_buffer = ""
This means the user hears the first sentence while the LLM is still generating the second. Critical optimization — saves 300-500ms perceived latency.
Network: geo-distributed TURN
WebRTC requires NAT traversal. Most setups put a single TURN server in one region. If the visitor is in Singapore and your TURN is in Frankfurt, that's +200ms round-trip just for the relay.
We use Coturn deployed in 6 regions (us-east, us-west, eu-west, eu-north, ap-south, ap-southeast) with anycast routing. The visitor's browser connects to whichever TURN is closest. Median visitor RTT to TURN: 30ms. Adds ~80ms to the total latency budget.
What didn't work
1. Whisper as primary
We tried Whisper-large as the primary STT for 2 months. The accuracy was great, but the latency was inconsistent (200ms to 800ms depending on GPU load). For real-time voice, predictable beats best. Switched back to Deepgram primary.
2. Caching LLM responses
We tried caching common Q&A responses (e.g., "what are your hours?"). Cache hit gave 0ms LLM time — incredible. But the caching layer introduced its own ~50ms lookup latency, and cache hits were <10% in practice (people phrase questions in too many ways). Removed it.
3. Smaller LLMs for everything
We thought we could route 80% of traffic to a fine-tuned 7B model for speed. The 7B's answers were noticeably worse — repetitive, less context-aware. Customers noticed. Backed off to 70B for most cases.
4. Skipping the VAD
We tried using LLM-only end-of-speech detection (skip the VAD signal entirely). Saved 50ms when it worked. But interruption false-positives went up 5x — the AI would start talking over the user. VAD stays.
What we'd do differently if starting today
- Build on a unified voice AI SDK from day one instead of stitching Deepgram + LLM + ElevenLabs ourselves. Companies like LiveKit and Pipecat now do this and would save us 6 months of plumbing.
- Spend more on the interruption-handling problem. We got to "acceptable" but not "great." It's still the most reported pain point.
- Bet earlier on multilingual TTS. We waited too long on Finnish/Polish voice quality. Should've sourced custom voice training earlier.
- Build observability before optimization. We optimized blind for the first 3 months. Adding Honeycomb-style per-call latency traces transformed our debugging.
What's next
We're targeting <500ms by Q3 2026. The biggest remaining win: streaming LLM-to-TTS-to-WebRTC as one pipeline rather than three sequential stages. Early experiments show 250ms achievable with custom inference infra.
If you've worked on real-time voice AI and want to compare notes, DM me on Twitter or email [email protected]. There are very few of us building browser-first AI voice and the community would benefit from more sharing.
Try the <800ms experience
60-second live demo. No signup. Just click and talk.