Skip to content
aiva
  • Product
  • Languages
  • Customers
  • Pricing
Sign inStart free
Product›Languages›Customers›Pricing›Start free
Already have an account? Sign in →

Ready when your customers are.

Start free, live in under four minutes. Or get a walkthrough in your inbox.

aiva

AI that answers every call, chat and text — in every Indian language.

All systems answering

Product

  • Voice
  • Web widget
  • SMS
  • Analytics
  • Integrations
  • Solutions

Company

  • Customers
  • About
  • Careers
  • Contact

Resources

  • Help center
  • Changelog
  • Status
  • Book a demo
  • Pricing

Legal

  • Privacy
  • Terms
  • DPA
  • GDPR
  • Cookies
  • Sub-processors
  • Security
© 2026 AIVA Technologies Pvt. Ltd.·Made with care in Rajkot, answering in 12 languages.
‹ Back to all posts
EngineeringApril 8, 20266 min read

How we shaved 40ms off voice latency.

Inference pipeline rewrite, regional caching, and the surprising thing we found out about Twilio's WebRTC encoder.

RM
Rohan Mehta
Engineering

240ms was already good. Our competitors were averaging 340–400ms. But in voice, the difference between 240ms and 200ms is the difference between a conversation that flows and one that feels like talking to a slow phone connection. We wanted 200ms. Here's how we got there — and the one finding that surprised us.

The pipeline audit

Every millisecond in voice has an address. Before optimising anything, we instrumented the entire call path to find where time was going. The results:

  • Speech-to-text recognition: 48ms
  • Context retrieval (fetching customer data): 31ms
  • LLM inference: 82ms
  • Text-to-speech synthesis: 44ms
  • Twilio WebRTC delivery: 35ms

Total: 240ms. Every number had a story.

Where the wins were

LLM inference (82ms → 51ms): We rewrote the inference pipeline to stream tokens directly to the TTS engine instead of waiting for the full response to complete. The first syllable of AIVA's response now starts synthesising before the last token has been generated. This alone was 20ms.

Context retrieval (31ms → 12ms): We moved to a warm cache for the most recent 500 messages per customer, keyed by session ID. Most support calls within a conversation hit the cache. Cold calls — the first contact from a number — still take 31ms, but that's a minority of traffic.

Regional caching (TTS): Common phrases — greetings, hold messages, clarification requests — now pre-synthesised and served from regional edge nodes. The "Namaste, how can I help you today?" greeting went from 44ms to 0ms. It's already in cache before the call connects.

The Twilio surprise

The Twilio delivery number — 35ms — looked fixed. It's network time. We can't control network physics.

Or so we thought.

Digging into Twilio's WebRTC implementation, we found that their default encoder uses Opus at 20ms frame intervals, but with a 40ms lookahead buffer for the codec's bitrate prediction. This means every audio packet was being held for an additional 30ms before transmission — a deliberate encoder trade-off for call quality at the cost of latency.

Twilio exposes a preferredCodecs override in their media constraints. Switching to Opus with maxptime=10 (10ms frames, no lookahead buffer) dropped the delivery latency from 35ms to 8ms. The trade-off: marginally lower audio quality on degraded connections. In our testing, call quality ratings were identical. The latency improvement was 27ms.

Result

Before: 240ms average, 310ms p95. After: 198ms average, 260ms p95.

We hit 200ms. The calls feel different. Customers don't notice latency consciously until it's bad — but they feel when it's good. Our CSAT across voice calls improved 0.3 points in the month after the optimisation shipped, with no other changes. We'll take it.

EngineeringVoiceML
Share
RM
Written by
Rohan Mehta
Engineering
More posts →

Keep reading

More from the team.

EngineeringMarch 11, 202614 min

Rewriting the voice pipeline (and why we'd do it again).

What we learned from ripping out our original voice stack and building it again from scratch.

By Rohan Mehta
EngineeringNovember 14, 202513 min

AIVA 2.0: the rebuild.

We rebuilt AIVA from the ground up in 2025 — faster, more languages, less infrastructure. Voice latency cut in half. Frankfurt and Virginia regions live. Here's what we changed and what it cost us.

By Arjun Patel

Like this? Get more.

One email a month. Engineering deep-dives, product launches, customer stories. No fluff.

4,200+ subscribers. Unsubscribe anytime.