Best Low-Latency TTS API: Cartesia, ElevenLabs, OpenAI TTS, and Azure

AI Audio2026-06-22YixScout editorial teamLast reviewed: 2026-06-25 by YixScout editorial team
012

The best low-latency TTS API for a realtime voice agent is usually Cartesia Sonic when first-audio speed and interruptible conversation are the main constraints. If you also need a broad voice library and cloning workflow, test ElevenLabs Flash v2.5 beside it; if your stack is already OpenAI-first, start with OpenAI TTS or the live audio path; if governance and multilingual enterprise coverage matter more than raw first-byte speed, compare Azure AI Speech. Fish Audio and Chatterbox belong in the second pass when pay-as-you-go economics, creator cloning, or open deployment matter.

Quick answer: measure time to first audio, finish latency, streaming behavior, and client playback delay separately. A model's advertised latency is only one piece of the voice-agent experience; LLM generation, region, codec, websocket/WebRTC transport, buffering, and concurrency can dominate what the user hears.
Benchmark snapshot: Vendor claim data from Cartesia and ElevenLabs is useful for shortlisting low-latency TTS APIs, while OpenAI and Azure require stack-level testing because transport, region, codec, and client buffering often dominate perceived latency. Treat vendor claims as a shortlist filter, then run a same-region P50/P90 local test before production. Local test pending.
Visual evidenceOriginal diagramChecked 2026-06-25
Low-latency TTS API decision map
Original decision map checked on June 25, 2026: select a TTS API by first audio, streaming, cloning needs, ecosystem fit, open deployment, and governance constraints before benchmarking it in your full voice-agent stack.

For low-latency TTS, first-byte or first-audio latency matters more than total file generation time because the user can begin hearing speech before the full response is complete. Microsoft Azure's latency guide makes the same distinction: first-byte latency is usually lower than finish latency, and streaming is critical because playback can begin when the first audio chunk arrives.

Cartesia is the clearest first test when the query is specifically low latency TTS API. Cartesia's current docs say Sonic 3.5 can stream the first byte of audio in 90ms and position Sonic for realtime conversational experiences, dubbing, narration, and AI avatars. That makes it the most direct match for interruptible assistants, phone agents, and avatar products where silence after the LLM answer feels broken.

ElevenLabs is the better first test when latency is important but voice quality, cloning, and creator-facing voice workflows are just as important. Its current docs list Flash v2.5 as the low-latency model for realtime applications, with about 75ms latency, 32 supported languages, and a 40,000 character limit. Its API pricing page lists Flash/Turbo TTS at $0.05 per 1K characters, so cost modeling should use character volume as well as plan allowances.

Fish Audio is not the lowest-latency specialist in this shortlist, but it is useful when creator voice cloning and simple usage-based billing are the main constraints. Its current developer pricing says API access is pay-as-you-go with no subscription fees or monthly minimums, and both current TTS model lines are priced per UTF-8 byte rather than per character plan bundle. That makes Fish Audio worth modeling when long-form volume and voice cloning economics are the question.

Chatterbox is the open or self-hosted path rather than a hosted API default. Resemble AI describes Chatterbox as open source and MIT licensed, with zero-shot voice cloning, emotion control, realtime voice synthesis, and on-premise deployment. Treat it as an engineering option for teams that can operate models, manage GPU capacity, and accept more implementation responsibility in exchange for control.

OpenAI TTS fits when the product already uses OpenAI models and the team wants the simplest integration path. OpenAI's speech docs recommend `gpt-4o-mini-tts` for intelligent realtime applications, note that `tts-1` has lower latency than `tts-1-hd` at lower quality, and document realtime audio streaming with chunk transfer encoding. For a full speech-to-speech agent with barge-in and natural turn taking, OpenAI points builders toward the live audio API path rather than a TTS-only pipeline.

Azure AI Speech is not the page's lowest-latency specialist pick, but it belongs in the shortlist for enterprise voice products and azure ai speech text to speech pricing free tier searches. Microsoft documents standard neural voices across 100+ languages and locales, Free F0 Neural Text to Speech at 0.5 million characters per month, custom voice options, Speech SDK and REST access, SSML controls, and per-character billing. Choose Azure when regional deployment, compliance expectations, custom branded voice governance, and existing Azure infrastructure outweigh shaving another few milliseconds from first audio.

A practical benchmark should send the same short prompt, same paragraph-length prompt, and same interruption-prone dialog turn to every vendor from the same region. Record time to first audio, time to usable playback, finish latency, audio duration, sample rate and codec, websocket or HTTP behavior, retry behavior, and cost per 1,000 characters or per minute. Then repeat under concurrency, because a provider that feels fast in a single request may behave differently at call-center volume.

Decision rule: pick Cartesia for realtime voice-agent latency, ElevenLabs for expressive cloned voices with low latency, OpenAI TTS for OpenAI-native product stacks, Azure AI Speech for multilingual enterprise deployments, Fish Audio for usage-based creator voice workflows, and Chatterbox for open or self-hosted experiments. Keep `/topics/best-tts` as the hub, then use `/alternatives/cartesia` and `/compare/elevenlabs-vs-cartesia` before a production pilot.
Related resource guides: if the same voice agent also needs listening, pair this Best Low-Latency TTS API guide with Best Speech-to-Text APIs at `/resources/columns/best-speech-to-text-apis`, then connect the audio loop through `/topics/best-asr`, `/tools/deepgram`, `/tools/assemblyai`, and `/compare/deepgram-vs-assemblyai`.

FAQ answer block: a low-latency TTS API is a speech synthesis service that can begin streaming audio quickly enough for conversational interfaces. For voice agents, prioritize first-audio latency, streaming playback, interruption handling, codec support, region placement, and stable performance under concurrency; for narration or audiobooks, prioritize long-form quality, editing controls, rights, and total cost instead.

Sources checked 2026-06-25: Cartesia overview and pricing, ElevenLabs TTS docs and API pricing, OpenAI text-to-speech and voice-agent docs, Azure text-to-speech overview, REST API, latency guide, Free F0 allowance, Fish Audio pricing, Chatterbox model page, and Speech pricing mechanics. Refresh due 2026-07-25.

Related resource guides