The original AIVA voice pipeline was a masterpiece of short-term thinking. Four months of moving fast had produced something that worked in demos, held together in production, and was completely unmaintainable. It was four separate vendor APIs duct-taped together with 3,000 lines of orchestration code that nobody fully understood, least of all the person who'd written most of it (me).
This is the story of the rewrite: what broke, what we built, and the one lesson I wish we'd learned earlier.
What was wrong
The architecture was called "the chain" internally. A voice call arrived at Twilio, got transcribed by API-A, the transcript got sent to API-B for language understanding, the response got sent to API-C for synthesis, and the audio got routed back through Twilio. Five network hops between a human speaking and AIVA responding.
Each hop had its own error surface. If API-B returned a 503 at 2am, the call failed silently — the customer heard hold music until the Twilio timeout fired. We had no visibility into which hop had failed. Our on-call routine was to check four different vendor dashboards in sequence, which took long enough that the customer had usually hung up by the time we found the problem.
The latency was also stuck. We'd optimised everything we could access — our own code, our caching layers, our database queries — and the p95 was still 380ms. The ceiling was the five-hop architecture itself. You can't make five network round-trips in 200ms, regardless of how fast each one is.
How we approached the rewrite
We ran the rewrite in parallel with the production system for six weeks before cutting over. This sounds obvious. It took us three weeks to decide to do it because the parallel approach felt slower — we wanted to just build the new thing and switch over. The six-week parallel run found eleven bugs that would have caused production incidents on day one. We're glad we did it.
The new architecture is a unified inference service. Speech recognition, language understanding, response generation, and synthesis all run in a single process, on hardware we control, in the same region as the Twilio media server. One network hop — from Twilio to us and back.
The latency improvement was immediate and dramatic. The first day the new pipeline was live, our p95 dropped from 380ms to 220ms. In the two months since, we've optimised it further to 260ms p95. The theoretical minimum with our current hardware is around 180ms. We're working toward it.
The most important thing we got right: we didn't rewrite the API contract. External callers — the webhook system, the dashboard — saw no change. The rewrite was internal infrastructure. Nobody outside the engineering team needed to do anything when it shipped.
The one lesson
We spent the first year building on top of vendor APIs because it felt safer. Vendors handle reliability, uptime, billing, scaling — you don't have to. This is true, and for early-stage companies, it's the right call.
But vendor APIs have a ceiling. When your requirements exceed what the vendor optimised for — and ours did, at around month 10 — you hit it. At that point, you have two choices: stay on the ceiling and compete with one hand tied behind your back, or do the hard work of owning the layer.
We stayed on the ceiling for six months longer than we should have because the rewrite felt scary. The six months cost us more in competitive disadvantage than the rewrite cost us in engineering time.
The lesson: vendor abstraction is debt, not safety. It defers cost, it doesn't eliminate it. Understand what you're deferring and when you'll have to pay.