Engineering May 22, 2026 · 12 min read

How we shaved AI voice latency to <800ms (and what we'd do differently)

A founder-engineer's post-mortem from the WebCallHub.ai team. The breakdown of every millisecond. Why we use Deepgram + Whisper as primary/fallback. Multi-model LLM routing. The optimizations that didn't work.

Why <800ms matters

Below ~1 second, AI voice starts to feel like a conversation. Above ~1.5 seconds, it feels like talking to someone with a bad WiFi connection — you wait, then they wait, then it goes wrong. Research from MIT and Bell Labs from the 1970s established that human conversational turn-taking averages 200ms. Anything above 800ms feels noticeably "robotic."

So 800ms isn't a marketing number. It's the threshold at which an AI agent can pass the "feels human" test.

The latency budget

Here's our actual end-to-end breakdown for a single utterance (visitor speaks → AI responds with first audio byte):

Stage	Budget	Notes
Network RTT (visitor → our edge)	~80ms	Geo-distributed TURN
VAD + end-of-speech detection	~50ms	WebRTC native VAD + custom silence detector
STT (streaming, partial → final)	~180ms	Deepgram Nova-2 / Whisper-large fallback
LLM time-to-first-token (TTFT)	~250ms	Claude Sonnet usually; varies per model
TTS first-byte (streaming)	~120ms	ElevenLabs Turbo (sentence-level)
Audio buffer + playback start	~80ms	jitter buffer minimum
Total to first audio byte	~760ms	Median of 1000 test calls

STT: why we use Deepgram + Whisper (both)

Deepgram is the primary. Their Nova-2 streaming model gives partial results in <200ms, with interim transcripts updating every 50-100ms. We don't wait for the final result — we feed partials into the LLM as soon as we have a coherent sentence boundary.

The tradeoff: Deepgram is fast but their Finnish/Polish accuracy is noticeably lower than Whisper-large. So we route per-language:

# Pseudo-code from our STT router
def route_stt(audio_stream, detected_language):
    if detected_language in DEEPGRAM_STRONG_LANGUAGES:
        return DeepgramStream(audio_stream, model="nova-2")
    elif detected_language in WHISPER_BETTER_LANGUAGES:
        # Self-hosted Whisper on Modal GPUs
        return WhisperStream(audio_stream, model="large-v3")
    else:
        return DeepgramStream(audio_stream, model="nova-2-general")

Self-hosting Whisper isn't free (~$0.30/hr per concurrent stream on Modal A100s), but Finnish accuracy is night-and-day vs Deepgram for our Nordic customers.

The end-of-speech problem

The hardest part isn't transcribing speech — it's knowing when the user stopped talking. Get it wrong and the AI either (a) interrupts the user mid-sentence, or (b) waits awkwardly for ~2 seconds.

Our approach: dual-signal end-of-speech detection.

VAD signal: WebRTC native VAD reports voice-off events at ~200ms granularity. We start a 400ms silence timer.
LLM signal: We also feed the partial STT into a small classifier (fine-tuned BERT, ~40M params, runs in 20ms) that predicts "is this utterance complete?" based on syntactic completeness, intonation markers from STT, etc.

If either signal fires high-confidence "user done," we cut the silence timer and start LLM inference immediately. This shaves about 200-300ms off the conversational gap without introducing false-positive interruptions.

LLM: multi-model routing

We don't use one LLM. We use three, routed per request:

Model	TTFT	Use case
Claude Sonnet (Anthropic)	~250ms	Nuanced conversations, complex reasoning, safety-sensitive
GPT-4o (OpenAI)	~280ms	Tool use, function calling, structured output
Llama-3.1-70B (fine-tuned, self-hosted)	~150ms	High-volume tier-1 support, FAQ answering

The routing logic is deceptively complex. Simplified:

def route_llm(transcript, agent_config, conversation_history):
    # If this is a tool-calling conversation, use GPT-4o
    if agent_config.has_active_tools and detect_tool_intent(transcript):
        return OpenAIClient(model="gpt-4o", stream=True)

    # If it's a high-volume known-intent query (FAQ-bot territory), use Llama
    if intent_classifier.predict(transcript) in FAQ_INTENTS:
        return LlamaClient(model="llama-3.1-70b-finetuned", stream=True)

    # Default: Claude for everything else
    return AnthropicClient(model="claude-sonnet-3.5", stream=True)

Llama on our own GPUs is the cheapest and fastest for ~60% of calls (FAQ-shaped queries). Claude is the fallback for everything else.

TTS: sentence-level streaming

The naive approach: wait for the full LLM response, send it to TTS, get audio back, play it. That adds ~500ms of waiting (the LLM finishing its response).

Our approach: stream LLM tokens → segment into sentences → send each sentence to ElevenLabs as soon as a sentence boundary lands → start playback on the visitor's browser as soon as the first audio chunk arrives.

async def stream_to_user(llm_stream):
    sentence_buffer = ""
    async for token in llm_stream:
        sentence_buffer += token
        if has_sentence_boundary(sentence_buffer):
            # Fire-and-forget TTS for this sentence
            asyncio.create_task(tts_and_play(sentence_buffer))
            sentence_buffer = ""

This means the user hears the first sentence while the LLM is still generating the second. Critical optimization — saves 300-500ms perceived latency.

Network: geo-distributed TURN

WebRTC requires NAT traversal. Most setups put a single TURN server in one region. If the visitor is in Singapore and your TURN is in Frankfurt, that's +200ms round-trip just for the relay.

We use Coturn deployed in 6 regions (us-east, us-west, eu-west, eu-north, ap-south, ap-southeast) with anycast routing. The visitor's browser connects to whichever TURN is closest. Median visitor RTT to TURN: 30ms. Adds ~80ms to the total latency budget.

What didn't work

1. Whisper as primary

We tried Whisper-large as the primary STT for 2 months. The accuracy was great, but the latency was inconsistent (200ms to 800ms depending on GPU load). For real-time voice, predictable beats best. Switched back to Deepgram primary.

2. Caching LLM responses

We tried caching common Q&A responses (e.g., "what are your hours?"). Cache hit gave 0ms LLM time — incredible. But the caching layer introduced its own ~50ms lookup latency, and cache hits were <10% in practice (people phrase questions in too many ways). Removed it.

3. Smaller LLMs for everything

We thought we could route 80% of traffic to a fine-tuned 7B model for speed. The 7B's answers were noticeably worse — repetitive, less context-aware. Customers noticed. Backed off to 70B for most cases.

4. Skipping the VAD

We tried using LLM-only end-of-speech detection (skip the VAD signal entirely). Saved 50ms when it worked. But interruption false-positives went up 5x — the AI would start talking over the user. VAD stays.

What we'd do differently if starting today

Build on a unified voice AI SDK from day one instead of stitching Deepgram + LLM + ElevenLabs ourselves. Companies like LiveKit and Pipecat now do this and would save us 6 months of plumbing.
Spend more on the interruption-handling problem. We got to "acceptable" but not "great." It's still the most reported pain point.
Bet earlier on multilingual TTS. We waited too long on Finnish/Polish voice quality. Should've sourced custom voice training earlier.
Build observability before optimization. We optimized blind for the first 3 months. Adding Honeycomb-style per-call latency traces transformed our debugging.

What's next

We're targeting <500ms by Q3 2026. The biggest remaining win: streaming LLM-to-TTS-to-WebRTC as one pipeline rather than three sequential stages. Early experiments show 250ms achievable with custom inference infra.

If you've worked on real-time voice AI and want to compare notes, DM me on Twitter or email [email protected]. There are very few of us building browser-first AI voice and the community would benefit from more sharing.

Try the <800ms experience

60-second live demo. No signup. Just click and talk.

Try Live Demo → Architecture Doc