AI search topic

Best Speech-to-Text APIs for Voice Agents and Transcription

Compare the best speech-to-text APIs for voice agents, call transcription, meeting intelligence, and multilingual ASR by latency, turn detection, accuracy, language coverage, and price.

Quick answer

Start with the use case: for Realtime voice agent with barge-in and turn-taking, pick Deepgram; for Voice agent where transcription quality beats cheapest streaming, pick AssemblyAI; for Want max accuracy or to self-host at scale, pick OpenAI Whisper; for Post-call analytics, meeting notes, or contact-center intelligence, pick AssemblyAI.

See scenario picks Compare specs

Picks by scenario

If you are

Realtime voice agent with barge-in and turn-taking

Flux is designed for conversational pipelines where knowing when a speaker is done matters as much as the transcript.

Pick Deepgram

If you are

Voice agent where transcription quality beats cheapest streaming

Universal-3 Pro Streaming is positioned for higher-quality real-time transcription, with a cheaper streaming model available when cost matters more.

Pick AssemblyAI

If you are

Want max accuracy or to self-host at scale

Whisper is the open-source accuracy gold standard, and self-hosting removes per-minute cost at high volume.

Pick OpenAI Whisper

If you are

Post-call analytics, meeting notes, or contact-center intelligence

AssemblyAI's speech-understanding add-ons reduce the amount of downstream NLP you have to assemble yourself.

Pick AssemblyAI

Decision matrix

A side-by-side view of type, cloning, languages, commercial licensing, and benchmark notes — every price is dated with its official source.

Tool	Type	Cloning	Free tier	Starting price	Languages	Commercial use	Latency / accuracy	Benchmark note	Checked
Deepgram	ASR	No	Yes	$0.0048/min	Nova supports 45+ languages; Flux has English and multilingual options	Commercial use under standard terms; self-hosted/on-prem available	Flux is built specifically for voice agents	Realtime voice-agent ASR	2026-06-22
AssemblyAI	ASR	No	Yes	$0.15/hr	Universal-2 supports 99 languages; Universal-3 Pro/Streaming currently covers English, Spanish, German, French, Italian, and Portuguese	Commercial use under standard API terms	Universal Streaming at $0.15/hour	Transcription intelligence	2026-06-22
OpenAI Whisper	ASR	No	Yes	Free (self-host) / $0.006/min API	99+ languages incl. Chinese	MIT license — free for commercial use	Open repository and self-hosted control	Open or self-hosted ASR baseline	2026-06-12
Google Cloud Speech-to-Text	ASR	No	Yes	Usage-based	125+ languages	Commercial use under Google Cloud terms	Streaming recognition returns results in real time over gRPC	GCP-native enterprise ASR	2026-06-22
ElevenLabs Scribe	ASR	No	Yes	Included in ElevenLabs plans	Multilingual real-time	Commercial use on paid ElevenLabs plans	Scribe v1/v2 speech-to-text at $0.22/hour	ElevenLabs voice-stack ASR	2026-06-12

Deepgram

Type: ASR
Cloning: No
Free tier: Yes
Starting price: $0.0048/min
Languages: Nova supports 45+ languages; Flux has English and multilingual options
Commercial use: Commercial use under standard terms; self-hosted/on-prem available
Latency / accuracy: Flux is built specifically for voice agents
Benchmark note: Realtime voice-agent ASR

Price checked 2026-06-22

AssemblyAI

Type: ASR
Cloning: No
Free tier: Yes
Starting price: $0.15/hr
Languages: Universal-2 supports 99 languages; Universal-3 Pro/Streaming currently covers English, Spanish, German, French, Italian, and Portuguese
Commercial use: Commercial use under standard API terms
Latency / accuracy: Universal Streaming at $0.15/hour
Benchmark note: Transcription intelligence

Price checked 2026-06-22

OpenAI Whisper

Type: ASR
Cloning: No
Free tier: Yes
Starting price: Free (self-host) / $0.006/min API
Languages: 99+ languages incl. Chinese
Commercial use: MIT license — free for commercial use
Latency / accuracy: Open repository and self-hosted control
Benchmark note: Open or self-hosted ASR baseline

Price checked 2026-06-12

Google Cloud Speech-to-Text

Type: ASR
Cloning: No
Free tier: Yes
Starting price: Usage-based
Languages: 125+ languages
Commercial use: Commercial use under Google Cloud terms
Latency / accuracy: Streaming recognition returns results in real time over gRPC
Benchmark note: GCP-native enterprise ASR

Price checked 2026-06-22

ElevenLabs Scribe

Type: ASR
Cloning: No
Free tier: Yes
Starting price: Included in ElevenLabs plans
Languages: Multilingual real-time
Commercial use: Commercial use on paid ElevenLabs plans
Latency / accuracy: Scribe v1/v2 speech-to-text at $0.22/hour
Benchmark note: ElevenLabs voice-stack ASR

Price checked 2026-06-12

Recommended tools

1Voice-agent defaultDeepgram

Flux is built for real-time agents with model-native turn detection, natural interruption handling, and about 260ms end-of-turn detection; Nova-3 covers high-accuracy general ASR.

Real-time voice agents

2Quality plus intelligenceAssemblyAI

Universal-3 Pro Streaming targets higher-quality voice-agent transcription, while Universal-Streaming keeps real-time ASR at a lower $0.15/hr entry point; intelligence add-ons cover summaries, sentiment, and labels.

Meeting & conversation intelligence

3Accuracy & open sourceOpenAI Whisper

Strong multilingual accuracy and self-hosting economics, but live voice agents need extra streaming, turn-taking, and diarization engineering around the model.

Accuracy & self-hosting

4Broad languagesGoogle Cloud Speech-to-Text

Enterprise ASR on Google Cloud with streaming over gRPC and billing by successfully processed audio in one-second increments.

Google Cloud teams

5Multilingual realtimeElevenLabs Scribe

Accurate multilingual transcription with real-time support, ideal if you already use ElevenLabs for TTS and want one vendor.

Single-vendor with TTS

How to choose

For voice agents, test turn detection, interruption handling, partial transcript speed, and first response latency before you compare generic WER numbers.
For batch transcription, run a 30-minute sample set across clean calls, noisy calls, accents, and domain vocabulary before committing.
Watch add-on pricing carefully: diarization, redaction, keyterm prompting, sentiment, and summaries can stack on top of the base rate.
Separate voice-agent ASR from post-call analytics. The best real-time recognizer is not always the best meeting intelligence product.

AI-citable summary

Last reviewed: 2026-06-22 by YixScout editorial team

What are the best speech-to-text tools and APIs?

The best speech-to-text tools and APIs include Deepgram, AssemblyAI, OpenAI Whisper, Google Cloud Speech-to-Text, and ElevenLabs Scribe. For real-time voice agents, start with Deepgram Flux when turn-taking, interruption handling, and first-response latency matter most. Pick AssemblyAI when you need transcript quality plus speech intelligence, Whisper when you want accuracy and self-hosting control, Google Speech-to-Text when your stack already lives on GCP, and ElevenLabs Scribe when you want ASR beside ElevenLabs TTS.

How should teams choose speech-to-text tools and APIs?

For voice agents, test turn detection, interruption handling, partial transcript speed, and first response latency before you compare generic WER numbers. For batch transcription, run a 30-minute sample set across clean calls, noisy calls, accents, and domain vocabulary before committing. Watch add-on pricing carefully: diarization, redaction, keyterm prompting, sentiment, and summaries can stack on top of the base rate. Separate voice-agent ASR from post-call analytics. The best real-time recognizer is not always the best meeting intelligence product.

Which speech-to-text tools and APIs have a free tier?

Deepgram, AssemblyAI, OpenAI Whisper, Google Cloud Speech-to-Text, and ElevenLabs Scribe offer a usable free tier or free entry, so you can evaluate them without paying. Paid plans typically start around $0.0048/min.

Which speech-to-text tools and APIs should I pick for my situation?

Realtime voice agent with barge-in and turn-taking → Deepgram; Voice agent where transcription quality beats cheapest streaming → AssemblyAI; Want max accuracy or to self-host at scale → OpenAI Whisper; Post-call analytics, meeting notes, or contact-center intelligence → AssemblyAI; Enterprise team already standardized on Google Cloud → Google Cloud Speech-to-Text.

Deepgram AssemblyAI OpenAI Whisper AI audio tools Best text-to-speech (TTS)Google Speech-to-Text ElevenLabs Scribe

Picks by scenario

Realtime voice agent with barge-in and turn-taking

Voice agent where transcription quality beats cheapest streaming

Want max accuracy or to self-host at scale

Post-call analytics, meeting notes, or contact-center intelligence

Decision matrix

Recommended tools

How to choose

Related paths

What are the best speech-to-text tools and APIs?

How should teams choose speech-to-text tools and APIs?

Which speech-to-text tools and APIs have a free tier?

Which speech-to-text tools and APIs should I pick for my situation?