On March 15th at 2:47 AM, our system processed its one millionth call. I know the exact time because I was awake, watching our dashboards, convinced something would break. Nothing did. But getting there? That's a story worth telling.
This isn't a victory lap. It's an honest account of what it takes to build voice AI infrastructure that actually works at scale including the spectacular failures that taught us more than our successes ever could.
The Day Everything Fell Apart
Let me start with our worst day: October 3rd, 2023.
We'd just signed our biggest customer a national retail chain expecting 50,000 calls during their holiday promotion. We were confident. Our load tests looked great. We'd provisioned extra capacity.
At 9:03 AM on launch day, our average latency spiked from 180ms to 2.4 seconds. Conversations became impossible. Customers were hanging up. Our customer's support lines were melting down. My phone wouldn't stop buzzing.
What went wrong? Our architecture had a hidden bottleneck we'd never seen in testing.
"Load tests lie. They tell you how your system handles synthetic traffic. Real traffic is messier, more correlated, and always finds the weakness you didn't know existed."
A very expensive lesson That cost us $47,000 in credits and nearly a customer
The Architecture That Finally Worked
After October 3rd, we rebuilt from first principles. Here's what our production system looks like today:
Edge Layer
Global PoPs for <50ms to any caller. WebRTC termination, initial audio processing.
Processing Layer
Distributed STT/NLU processing. Auto-scaling Kubernetes clusters. Regional failover.
Intelligence Layer
LLM inference with response caching. Context management. Decision routing.
Integration Layer
CRM connectors, action execution, human handoff orchestration.
The Five Lessons That Changed Everything
Lesson 1: Latency Is a Feature, Not a Metric
In web applications, the difference between 200ms and 400ms response time is barely noticeable. In voice AI, it's the difference between natural conversation and awkward silence.
We obsess over P99 latency, not averages. A system with 150ms average but 800ms P99 will frustrate 1 in 100 users consistently. They'll never trust it.
Lesson 2: Cache Everything That Doesn't Change Mid-Conversation
Here's a secret: about 40% of what our AI "thinks about" during a call doesn't actually require real-time computation.
Customer history? Cached. Product information? Cached. Common response patterns? Cached. We only hit our LLM for genuine reasoning tasks.
This caching strategy reduced our compute costs by 34% and improved average latency by 28%.
Lesson 3: Graceful Degradation Isn't Optional
Systems fail. Networks have bad days. Cloud providers have outages. The question isn't if something will break it's what happens when it does.
Our degradation hierarchy:
- Primary region down: Auto-failover to secondary in <3 seconds
- LLM latency spike: Fall back to cached responses for common intents
- Integration failure: Queue actions, complete call, retry async
- Complete outage: Graceful handoff to human queue with context preserved
Lesson 4: Observability Is Your Immune System
We track 847 distinct metrics across our infrastructure. That sounds excessive until you realize that our October 3rd incident would have been caught by metric #312 (connection pool saturation rate) if we'd been watching it.
Lesson 5: The Best Architecture Is One You Can Change
Our system looks nothing like it did 18 months ago. That's not technical debt it's evolution.
The Numbers Today
Want to Join Our Engineering Team?
We're hiring engineers who love solving hard problems at scale.
View Open Roles →


