When we started building Ring AI, we obsessed over one metric above all others: latency. Here's why response time is the most important factor in voice AI, and how we achieved sub-200ms responses.
The Psychology of Conversation
Humans are remarkably sensitive to timing in conversation. Research shows that the average gap between turns in natural dialogue is just 200 milliseconds. Any longer, and the conversation starts to feel awkward.
This is why traditional voice AI feels robotic. When there's a 1-2 second delay after you speak, your brain registers it as unnatural. You start to wonder:
- Did it hear me?
- Should I repeat myself?
- Is this thing broken?
Breaking Down the Pipeline
A typical voice AI response involves several steps:
- Speech-to-text (100-300ms)
- Language model processing (200-500ms)
- Text-to-speech (100-300ms)
- Network round trips (50-200ms)
Add these up, and you're looking at 450ms to 1.3 seconds—far too slow for natural conversation.
Our Approach
We rebuilt every component of this pipeline with latency as the primary constraint:
Streaming Everything
Instead of waiting for complete sentences, we process speech in chunks as small as 20ms. Our models start generating responses before you've finished speaking.
Custom TTS
Off-the-shelf text-to-speech adds 200-300ms of latency. We built our own streaming TTS that begins playback within 50ms of receiving text.
Edge Deployment
Every millisecond of network latency matters. We deploy our models at the edge, as close to the telephony infrastructure as possible.
The Results
Our production system achieves:
- P50 latency: 180ms
- P95 latency: 240ms
- P99 latency: 320ms
This puts us well within the natural conversational timing window, resulting in interactions that feel genuinely human.
Try It Yourself
The best way to understand the difference is to experience it. Sign up for a free account and make a test call to one of our demo agents.
Next week, we'll dive deeper into our speech recognition pipeline and how we handle interruptions and cross-talk.