If my business relied heavily on voice agents through platforms like Retell, ElevenLabs, or Vapi, and I wanted to slash my voice AI costs from around $9/hour to under $2/hour, here's exactly how I'd approach it 👇
First, break down the real costs. These platforms typically charge about $0.15/minute to manage your SST → LLM → TTS pipeline, with TTS alone making up roughly half of that.
Next, I'd rent a GPU server ($500-$1000/month) and set up an open-source TTS model like StyleTTS2, fine-tuning it until I achieve the exact voice I need. Sure, you could run it on your home GPU, but leveraging cloud credits from AWS, Google, or Azure is more professional and scalable.
For the LLM component, I'd simply leverage services like Claude, GPT-4o, or Gemini. The price and performance curve is still evolving quickly, making custom LLM development unnecessary at this stage.
I'd then deploy Deepgram for speech-to-text, fine-tuning it specifically for my use case, streaming seamlessly between autoregressive and non-autoregressive models. This might take some tweaking, but it's straightforward enough with the right tools.
Infrastructure-wise, I'd deploy Deepgram directly onto my servers to eliminate latency, enhancing LLM prompt capacity. I'd layer in aggressive caching, especially at the TTS stage since repeated inputs yield identical audio outputs, significantly boosting performance and cost-efficiency.
Finally, I'd rely on AWS, Azure, or GCP for compute scalability and Twilio for telephony.
With this approach, my costs drop dramatically to roughly $0.03/minute—around $1.80/hour—achieving over an 80% reduction.
Alternatively, if building this isn't your thing, I'd simply hop onto Primvoices.com and achieve similar savings (70%+) without any hassle.