Engineering

Building Scalable Voice AI: What We Learned from 1 Million Calls

The engineering playbook behind processing 1M+ AI-powered conversations including the failures that taught us the most.

David Kim

David Kim

VP of Engineering

Nov 28, 202414 min read
Building Scalable Voice AI: What We Learned from 1 Million Calls

On March 15th at 2:47 AM, our system processed its one millionth call. I know the exact time because I was awake, watching our dashboards, convinced something would break. Nothing did. But getting there? That's a story worth telling.

This isn't a victory lap. It's an honest account of what it takes to build voice AI infrastructure that actually works at scale including the spectacular failures that taught us more than our successes ever could.

Why This Matters: Voice AI has a latency budget of about 300ms before conversations feel awkward. That's 5-10x more demanding than typical web applications. Scale problems hit earlier and harder.

The Day Everything Fell Apart

Let me start with our worst day: October 3rd, 2023.

We'd just signed our biggest customer a national retail chain expecting 50,000 calls during their holiday promotion. We were confident. Our load tests looked great. We'd provisioned extra capacity.

At 9:03 AM on launch day, our average latency spiked from 180ms to 2.4 seconds. Conversations became impossible. Customers were hanging up. Our customer's support lines were melting down. My phone wouldn't stop buzzing.

What went wrong? Our architecture had a hidden bottleneck we'd never seen in testing.

The Architecture That Finally Worked

After October 3rd, we rebuilt from first principles. Here's what our production system looks like today:

Edge Layer

Global PoPs for <50ms to any caller. WebRTC termination, initial audio processing.

Processing Layer

Distributed STT/NLU processing. Auto-scaling Kubernetes clusters. Regional failover.

Intelligence Layer

LLM inference with response caching. Context management. Decision routing.

Integration Layer

CRM connectors, action execution, human handoff orchestration.

The Five Lessons That Changed Everything

Lesson 1: Latency Is a Feature, Not a Metric

In web applications, the difference between 200ms and 400ms response time is barely noticeable. In voice AI, it's the difference between natural conversation and awkward silence.

We obsess over P99 latency, not averages. A system with 150ms average but 800ms P99 will frustrate 1 in 100 users consistently. They'll never trust it.

147ms
P50 Latency
Median response time
203ms
P95 Latency
95th percentile
289ms
P99 Latency
Worst 1% of calls
99.97%
Uptime
Last 12 months

Lesson 2: Cache Everything That Doesn't Change Mid-Conversation

Here's a secret: about 40% of what our AI "thinks about" during a call doesn't actually require real-time computation.

Customer history? Cached. Product information? Cached. Common response patterns? Cached. We only hit our LLM for genuine reasoning tasks.

This caching strategy reduced our compute costs by 34% and improved average latency by 28%.

Lesson 3: Graceful Degradation Isn't Optional

Systems fail. Networks have bad days. Cloud providers have outages. The question isn't if something will break it's what happens when it does.

Our degradation hierarchy:

  • 1 Primary region down: Auto-failover to secondary in <3 seconds
  • 2 LLM latency spike: Fall back to cached responses for common intents
  • 3 Integration failure: Queue actions, complete call, retry async
  • 4 Complete outage: Graceful handoff to human queue with context preserved

Lesson 4: Observability Is Your Immune System

We track 847 distinct metrics across our infrastructure. That sounds excessive until you realize that our October 3rd incident would have been caught by metric #312 (connection pool saturation rate) if we'd been watching it.

Lesson 5: The Best Architecture Is One You Can Change

Our system looks nothing like it did 18 months ago. That's not technical debt it's evolution.

Hard-Won Wisdom: Design for replaceability, not permanence. The technology landscape changes too fast to bet everything on any single provider or approach.

The Numbers Today

1.2M+
Calls Processed
And counting
12
Languages
Real-time support
15K+
Concurrent Calls
Peak capacity tested
$2.3M
Infra Cost Saved
Via optimization work
For Engineers Building Similar Systems: Start with observability. Build for failure. Cache aggressively. And remember the system that survives isn't the most sophisticated one, it's the one that fails gracefully.

Want to Join Our Engineering Team?

We're hiring engineers who love solving hard problems at scale.

View Open Roles →
EngineeringScalabilityArchitecturePerformanceInfrastructure
Share:
David Kim

Written by

David Kim

VP of Engineering

David has spent 15 years building systems that scale. Previously led infrastructure at Stripe and AWS. He believes the best systems are invisible.

@davidkim_eng