The Voice Latency Problem
Voice conversations are unforgiving. Humans expect responses within 500โ800ms. Any delay longer than that feels broken. Traditional LLM pipelines โ transcription โ LLM โ TTS โ can easily take 2โ4 seconds without optimization. Vapi solves much of this, but you still need to understand the tradeoffs.
Keep round-trip time under 600ms for natural feel. This means: using streaming TTS, streaming LLM output, and implementing intelligent end-of-turn detection.
Key Design Patterns
Barge-In Handling
Users interrupt. Always. Your agent needs to detect when the user starts speaking mid-response, immediately stop the TTS stream, and process the new input. Vapi handles this at the infrastructure level, but your dialog flows need to be designed to gracefully handle interruption at any point.
Intent Handling Over Exact Matching
Voice input is messy โ background noise, accents, partial sentences. Don't rely on exact string matching. Use intent classification with fuzzy matching and always ask for confirmation before taking irreversible actions.
Constrain Tool-Calls
Voice agents that can take actions (book appointments, update CRM, send emails) must be extremely conservative. Use idempotent, reversible operationsonly. Every action that has real-world consequences should require verbal confirmation from the user: "I'm going to book you for Tuesday at 3pm โ does that sound right?"
Grounding Responses
Ground every factual claim to a trusted source. If your voice agent answers questions about pricing, policies, or availability, it must pull from a live data source โ not from the model's training data, which may be outdated.
Vapi-Specific Configuration
- STT model: Use
whisper-large-v3for accuracy,whisper-basefor speed - TTS voice: Choose voices with low first-token latency. Eleven Labs voices are high quality but slower
- Interruption sensitivity: Tune based on use case โ support calls need high sensitivity, interviews need low
- Max duration: Always set a max call duration to prevent runaway costs
- Fallback: Configure a human transfer number for when the agent can't help
Use Case Patterns
Appointment Booking
Healthcare, beauty, home services. Agent asks for preferred time, checks availability via tool call (e.g. Calendly API), proposes slots, confirms verbally, books and sends SMS confirmation.
Inbound Sales Qualification
Ask the 5 qualification questions your sales team always asks. Score the lead. Route hot leads to a human immediately. Log everything to CRM. Saves 20 min of rep time per inbound call.
Customer Support Tier-1
Handle the top 10 most common support requests without human involvement. Escalate when confidence is low. Track escalation rate โ if it's above 30%, expand the knowledge base.
Production Checklist
- โ P95 latency under 700ms measured end-to-end
- โ Barge-in tested with concurrent speakers
- โ All tool-calls require verbal confirmation
- โ Human fallback number configured
- โ Max call duration set
- โ Call recordings stored and reviewed weekly