240ms was already good. Our competitors were averaging 340–400ms. But in voice, the difference between 240ms and 200ms is the difference between a conversation that flows and one that feels like talking to a slow phone connection. We wanted 200ms. Here's how we got there — and the one finding that surprised us.
The pipeline audit
Every millisecond in voice has an address. Before optimising anything, we instrumented the entire call path to find where time was going. The results:
- Speech-to-text recognition: 48ms
- Context retrieval (fetching customer data): 31ms
- LLM inference: 82ms
- Text-to-speech synthesis: 44ms
- Twilio WebRTC delivery: 35ms
Total: 240ms. Every number had a story.
Where the wins were
LLM inference (82ms → 51ms): We rewrote the inference pipeline to stream tokens directly to the TTS engine instead of waiting for the full response to complete. The first syllable of AIVA's response now starts synthesising before the last token has been generated. This alone was 20ms.
Context retrieval (31ms → 12ms): We moved to a warm cache for the most recent 500 messages per customer, keyed by session ID. Most support calls within a conversation hit the cache. Cold calls — the first contact from a number — still take 31ms, but that's a minority of traffic.
Regional caching (TTS): Common phrases — greetings, hold messages, clarification requests — now pre-synthesised and served from regional edge nodes. The "Namaste, how can I help you today?" greeting went from 44ms to 0ms. It's already in cache before the call connects.
The Twilio surprise
The Twilio delivery number — 35ms — looked fixed. It's network time. We can't control network physics.
Or so we thought.
Digging into Twilio's WebRTC implementation, we found that their default encoder uses Opus at 20ms frame intervals, but with a 40ms lookahead buffer for the codec's bitrate prediction. This means every audio packet was being held for an additional 30ms before transmission — a deliberate encoder trade-off for call quality at the cost of latency.
Twilio exposes a preferredCodecs override in their media constraints. Switching to Opus with maxptime=10 (10ms frames, no lookahead buffer) dropped the delivery latency from 35ms to 8ms. The trade-off: marginally lower audio quality on degraded connections. In our testing, call quality ratings were identical. The latency improvement was 27ms.
Result
Before: 240ms average, 310ms p95. After: 198ms average, 260ms p95.
We hit 200ms. The calls feel different. Customers don't notice latency consciously until it's bad — but they feel when it's good. Our CSAT across voice calls improved 0.3 points in the month after the optimisation shipped, with no other changes. We'll take it.