How we shaved 40ms off voice latency.

240ms was already good. Our competitors were averaging 340–400ms. But in voice, the difference between 240ms and 200ms is the difference between a conversation that flows and one that feels like talking to a slow phone connection. We wanted 200ms. Here's how we got there — and the one finding that surprised us.

The pipeline audit

Every millisecond in voice has an address. Before optimising anything, we instrumented the entire call path to find where time was going. The results:

Speech-to-text recognition: 48ms
Context retrieval (fetching customer data): 31ms
LLM inference: 82ms
Text-to-speech synthesis: 44ms
Twilio WebRTC delivery: 35ms

Total: 240ms. Every number had a story.

Where the wins were

LLM inference (82ms → 51ms): We rewrote the inference pipeline to stream tokens directly to the TTS engine instead of waiting for the full response to complete. The first syllable of AIVA's response now starts synthesising before the last token has been generated. This alone was 20ms.

Context retrieval (31ms → 12ms): We moved to a warm cache for the most recent 500 messages per customer, keyed by session ID. Most support calls within a conversation hit the cache. Cold calls — the first contact from a number — still take 31ms, but that's a minority of traffic.

Regional caching (TTS): Common phrases — greetings, hold messages, clarification requests — now pre-synthesised and served from regional edge nodes. The "Namaste, how can I help you today?" greeting went from 44ms to 0ms. It's already in cache before the call connects.

The Twilio surprise

The Twilio delivery number — 35ms — looked fixed. It's network time. We can't control network physics.

Or so we thought.

Digging into Twilio's WebRTC implementation, we found that their default encoder uses Opus at 20ms frame intervals, but with a 40ms lookahead buffer for the codec's bitrate prediction. This means every audio packet was being held for an additional 30ms before transmission — a deliberate encoder trade-off for call quality at the cost of latency.

Twilio exposes a preferredCodecs override in their media constraints. Switching to Opus with maxptime=10 (10ms frames, no lookahead buffer) dropped the delivery latency from 35ms to 8ms. The trade-off: marginally lower audio quality on degraded connections. In our testing, call quality ratings were identical. The latency improvement was 27ms.

Result

Before: 240ms average, 310ms p95. After: 198ms average, 260ms p95.

We hit 200ms. The calls feel different. Customers don't notice latency consciously until it's bad — but they feel when it's good. Our CSAT across voice calls improved 0.3 points in the month after the optimisation shipped, with no other changes. We'll take it.

EngineeringVoiceML

Written by

Rohan Mehta

Engineering

The pipeline audit

Every millisecond in voice has an address. Before optimising anything, we instrumented the entire call path to find where time was going. The results:

Speech-to-text recognition: 48ms

Context retrieval (fetching customer data): 31ms

LLM inference: 82ms

Text-to-speech synthesis: 44ms

Twilio WebRTC delivery: 35ms

Total: 240ms. Every number had a story.

Where the wins were

The Twilio surprise

The Twilio delivery number — 35ms — looked fixed. It's network time. We can't control network physics.

Or so we thought.

Result

Before: 240ms average, 310ms p95. After: 198ms average, 260ms p95.

How we shaved 40ms off voice latency.

The pipeline audit

Where the wins were

The Twilio surprise

Result

More from the team.

Rewriting the voice pipeline (and why we'd do it again).

AIVA 2.0: the rebuild.

Like this? Get more.

How we shaved 40ms off voice latency.

The pipeline audit

Where the wins were

The Twilio surprise

Result

More from the team.

Rewriting the voice pipeline (and why we'd do it again).

AIVA 2.0: the rebuild.

Like this? Get more.