Voice Agents with Vapi.ai — Latency, Barge-In & Grounding

The Voice Latency Problem

Voice conversations are unforgiving. Humans expect responses within 500–800ms. Any delay longer than that feels broken. Traditional LLM pipelines — transcription → LLM → TTS — can easily take 2–4 seconds without optimization. Vapi solves much of this, but you still need to understand the tradeoffs.

Keep round-trip time under 600ms for natural feel. This means: using streaming TTS, streaming LLM output, and implementing intelligent end-of-turn detection.

Key Design Patterns

Barge-In Handling

Users interrupt. Always. Your agent needs to detect when the user starts speaking mid-response, immediately stop the TTS stream, and process the new input. Vapi handles this at the infrastructure level, but your dialog flows need to be designed to gracefully handle interruption at any point.

Intent Handling Over Exact Matching

Voice input is messy — background noise, accents, partial sentences. Don't rely on exact string matching. Use intent classification with fuzzy matching and always ask for confirmation before taking irreversible actions.

Constrain Tool-Calls

Voice agents that can take actions (book appointments, update CRM, send emails) must be extremely conservative. Use idempotent, reversible operations only. Every action that has real-world consequences should require verbal confirmation from the user: "I'm going to book you for Tuesday at 3pm — does that sound right?"

Grounding Responses

Ground every factual claim to a trusted source. If your voice agent answers questions about pricing, policies, or availability, it must pull from a live data source — not from the model's training data, which may be outdated.

Vapi-Specific Configuration

STT model: Use whisper-large-v3 for accuracy, whisper-base for speed
TTS voice: Choose voices with low first-token latency. Eleven Labs voices are high quality but slower
Interruption sensitivity: Tune based on use case — support calls need high sensitivity, interviews need low
Max duration: Always set a max call duration to prevent runaway costs
Fallback: Configure a human transfer number for when the agent can't help

Use Case Patterns

Appointment Booking

Healthcare, beauty, home services. Agent asks for preferred time, checks availability via tool call (e.g. Calendly API), proposes slots, confirms verbally, books and sends SMS confirmation.

Inbound Sales Qualification

Ask the 5 qualification questions your sales team always asks. Score the lead. Route hot leads to a human immediately. Log everything to CRM. Saves 20 min of rep time per inbound call.

Customer Support Tier-1

Handle the top 10 most common support requests without human involvement. Escalate when confidence is low. Track escalation rate — if it's above 30%, expand the knowledge base.

Production Checklist

✓ P95 latency under 700ms measured end-to-end
✓ Barge-in tested with concurrent speakers
✓ All tool-calls require verbal confirmation
✓ Human fallback number configured
✓ Max call duration set
✓ Call recordings stored and reviewed weekly

Voice Agents with Vapi