Skip to main content
By Alex · Updated Apr 12, 2026 Transcription models turn speech into text through an API or a self-hosted checkpoint — the piece you build on when your product needs to listen. The easy choices used to be “OpenAI Whisper” and “pick a vendor,” but neither still holds: Whisper is no longer the accuracy leader on any major benchmark, and a wave of 2025 releases reshuffled the field. We compared 52 models across accuracy, speed, price, language coverage, and deployment mode to pick the 11 worth your attention.
This is a roundup of transcription models — the APIs and open-weight checkpoints developers integrate. If you want a ready-to-use app (Otter, Descript, MacWhisper, Superwhisper), see our separate AI transcription tools roundup.

Best Transcription Models

#ModelBest ForArchitectureAccessPrice per 1K min
1AssemblyAI Universal-3 ProBest for most buildersDedicated ASRManaged API$3.50
2Deepgram Nova-3Best for voice agentsDedicated ASRManaged API~$4.30
3ElevenLabs Scribe v2Best for accuracy-critical audioDedicated ASRManaged API$6.67
4Whisper Large v3 Turbo (on Groq)Best for price-sensitive buildsDedicated ASRManaged API + open weights$0.67
5Google Gemini familyBest for mixed audio tasksMultimodal LLMManaged API0.190.19 – 18.40
6Mistral Voxtral SmallBest open-weight accuracySpeech-augmented LLMManaged API + open weights$4.00 (API) / self-host
7NVIDIA Parakeet TDT 0.6BBest for batch throughputDedicated ASROpen weights + NIMSelf-host
8OpenAI GPT-4o-transcribeBest for OpenAI-native stacksMultimodal LLMManaged API$6.00
9NVIDIA Canary-Qwen-2.5BBest small open-weight modelSpeech-augmented LLMOpen weightsSelf-host
10Cohere Transcribe 03-2026Best benchmark leader todayDedicated ASRManaged API + open weightsFree tier / Vault
11Speechmatics EnhancedBest for non-English and code-switchingDedicated ASRManaged API~$6.70

1. AssemblyAI Universal-3 Pro: Best for most builders

Universal-3 Pro is the model we’d pick first if someone handed us a new project and said “just make the speech-to-text piece work.” It lands near the top of every benchmark that matters — 3.2% word error rate on Artificial Analysis, #11 on the Hugging Face Open ASR Leaderboard — at half the list price of the accuracy leaders, and AssemblyAI ships separate batch and streaming variants so you don’t have to pick one and hope. What surprised us most was how thoroughly the API is feature-complete: diarization, word-level timestamps, custom vocabulary, PII redaction, prompting with domain hints, and disfluency control are all first-class parameters rather than partner integrations. For a builder evaluating a dozen STT APIs on a weekend, Universal-3 Pro is the one that leaves the fewest gaps to fill in yourself.

Key Features

  • Top-tier batch accuracy with a sibling streaming model (Universal-Streaming) that shares the same API shape, so switching between the two is a one-line change
  • Prompting support — feed product names, acronyms, and speaker context as text at request time; the model weights them into decoding
  • Speaker diarization and word-level timestamps returned in the same response; no separate post-processing pass
  • PII redaction and entity detection built into the API rather than behind a partner integration
  • Universal-3 Pro specifically exposes disfluency control, so you can choose whether “um” and “uh” survive into the final transcript

Pros

  • The first-party developer experience is the best in the category. Docs, SDKs, and error messages are written as if someone actually used them before publishing
  • Real-world accuracy on noisy, multi-speaker audio (meetings, customer calls) matches or beats every proprietary competitor except ElevenLabs Scribe v2 — but at a meaningfully lower price point
  • The streaming variant’s end-of-turn detection is sharp enough to use for voice agents without bolting on a separate VAD stage
  • Custom vocabulary and prompting mean domain jargon (medical, legal, financial) doesn’t require a fine-tune to work

Cons

  • There’s a real gap between Universal-3 Pro’s 3.2% benchmark WER and the 2.3% you’d get from ElevenLabs Scribe v2 — if a few extra errors per thousand words is a dealbreaker for your use case, the accuracy ceiling isn’t here
  • Streaming is a separate model from batch, not a mode switch, so you’ll maintain two integrations if you need both
  • AssemblyAI’s free tier is time-limited rather than ongoing, so you can’t leave a prototype running on it indefinitely — if you need perpetually free low-volume access, Groq Whisper Turbo is the better fit

Pricing

PlanPriceWhat’s Included
Free trialLimited creditsFull Universal-3 Pro access for evaluation
Pay-as-you-go$3.50 per 1K minutes (batch)Universal-3 Pro, diarization, timestamps, custom vocab, PII redaction
StreamingSeparate tierUniversal-Streaming model for real-time voice-agent use
EnterpriseCustomSLA, dedicated support, VPC deployment, HIPAA BAA

Deployment / Access

Managed API (HTTPS, WebSocket for streaming) · Python, TypeScript, Go, Ruby, Java SDKs · SOC 2 Type II, HIPAA BAA available on enterprise tier · US and EU regions.

Who It’s For (and Who Should Skip It)

Pick Universal-3 Pro if you’re building a product that needs speech-to-text as a component and you want to stop thinking about the STT piece quickly. Skip it if you need the absolute lowest WER regardless of cost (Scribe v2 is ~1 point more accurate on Artificial Analysis) or if you only need English and want to pay as little as humanly possible (Groq Whisper Turbo runs at $0.67 per 1K min with accuracy that’s within noise for most audio). Try AssemblyAI

2. Deepgram Nova-3: Best for voice agents

Nova-3 is the model the voice-agent community actually ships to production — and its middling benchmark score is precisely why. Nova-3 lands at 5.4% word error rate on Artificial Analysis, noticeably behind ElevenLabs Scribe v2 and Voxtral Small. But WER on clean batch audio is not what voice-agent builders are buying. They’re buying 150-millisecond time-to-first-token, WebSocket stability under load, sharp end-of-turn detection that doesn’t cut users off mid-sentence, and a streaming-native product (Flux) that integrates turn-taking directly into the model rather than bolting on a VAD. When you’re running 10,000 concurrent phone calls and every 100 milliseconds of latency costs you a conversation, the model that “just works” at the call-audio codec beats the model that’s two percentage points more accurate on LibriSpeech every time.

Key Features

  • Flux (launched Oct 2025) integrates end-of-turn detection into the model, eliminating the separate VAD stage that competitors require for clean turn-taking
  • Streaming-first architecture with WebSocket connections designed for always-on voice-agent use — reconnection, jitter buffering, and partial-transcript streaming all handled server-side
  • Nova family spans a speed/price spectrum: Nova-3 for best quality, Nova-2 for high-throughput pipelines, Base for absolute speed at the cost of some accuracy
  • Custom vocabulary (“keyterms”) and a purpose-built call-center feature set (channel separation, custom entity detection, PII redaction)
  • The lowest TTFT (time-to-first-token) we’ve seen on any streaming STT — under 200 milliseconds is routine

Pros

  • The voice-agent market’s de facto default for good reason: latency, streaming stability, and real call-audio performance consistently beat everything else even when benchmark WER doesn’t
  • Flux’s integrated turn detection removes an entire class of bugs (premature interruptions, awkward long pauses) that plague VAD-plus-STT stacks
  • Documentation and sample code are obviously written by people who’ve built voice agents themselves — the tutorials walk you through the real failure modes, not just the happy path
  • Stable enough that provider outages are a rare Twitter headline, not a weekly annoyance

Cons

  • On clean batch audio (podcasts, narrated audiobooks, clean meeting recordings), Nova-3’s 5.4% AA-WER is beaten by ElevenLabs Scribe v2 (2.3%), Voxtral Small (2.9%), AssemblyAI Universal-3 Pro (3.2%), and even Whisper v3 on fal.ai (4.2%). If your use case is batch transcription, you’re paying for voice-agent infrastructure you don’t need
  • No open-weight option — Deepgram doesn’t publish weights, so you can’t self-host or fine-tune for privacy-sensitive deployments
  • Deepgram’s pricing is list-price transparent but doesn’t compete on the bottom-end for vibe-coders — at $4.30 per 1K minutes you’re paying ~6x Groq Whisper Turbo for an accuracy gain that only matters if you’re running streaming

Pricing

PlanPriceWhat’s Included
Free trial$200 creditFull Nova-3 access for evaluation
Pay-as-you-go~$4.30 per 1K minNova-3, diarization, keyterms, PII redaction
Flux (streaming)Similar per-minute tierVoice-agent streaming with integrated turn detection
EnterpriseCustomSLA, dedicated capacity, HIPAA BAA, VPC

Deployment / Access

Managed API (REST + WebSocket) · Python, JavaScript, .NET, Go, Rust SDKs · SOC 2, HIPAA, PCI-DSS · US, EU, and self-hosted enterprise deployment options.

Who It’s For (and Who Should Skip It)

Pick Nova-3 if you’re building a voice agent, customer-support bot, or any product where a user is speaking to your software in real time and waiting for it to respond. Skip it if your job is batch transcription — podcasts, meeting recordings, audiobook captioning — because AssemblyAI Universal-3 Pro gets you more accuracy at less money, and Groq Whisper Turbo gets you far lower cost with WER that’s within noise on most batch audio. Try Deepgram

3. ElevenLabs Scribe v2: Best for accuracy-critical audio

Scribe v2 tops Artificial Analysis at 2.3% word error rate — which doesn’t sound dramatic until you notice that nothing else is within a full percentage point. The next-best tier (Voxtral Small, Gemini 3 Pro) sits at 2.9%; the voice-agent incumbents start at 3.2% and climb. For audio where every wrong word costs you downstream — legal depositions, medical dictation, financial earnings calls, broadcast captioning — the gap is real and worth paying for. Scribe v2 (batch) also ships with diarization up to 32 speakers in a single file, supports up to 1,000 context-aware “keyterms” for domain vocabulary after an April 2026 upgrade, and returns audio event tags (laughter, applause, non-speech sounds) alongside the transcript. There’s a separate streaming model — Scribe v2 Realtime — that targets voice-agent use cases, but it’s a different model with different trade-offs: no diarization and a server-side VAD default that catches voice-agent builders off guard.

Key Features

  • Lowest word error rate of any proprietary API we tested — roughly a full percentage point clear of the next tier on Artificial Analysis
  • Diarization up to 32 distinct speakers in a single recording in the batch model; most competitors cap at 10 or fewer
  • Keyterm prompting: up to 1,000 context-aware domain vocabulary terms (expanded from 100 in the April 2026 upgrade), each up to 5 words, weighted during decoding without forcing substitution
  • 56 entity categories for PII and regulated-data redaction, including enumerated modes (e.g., [CREDIT_CARD_1]) for downstream reconciliation
  • Scribe v2 Realtime (launched Nov 2025) is a separate streaming model for voice-agent use — no diarization support, but accurate real-time transcription with language auto-detection and mid-conversation language switching
  • Audio event tags alongside the transcript — laughter, applause, footsteps, and other non-speech markers returned as structured entries
  • 90+ languages in the batch model with automatic language detection and Indic-English code-switching that preserves Latin script for English words (fixed in the April 2026 upgrade)

Pros

  • For any use case where an extra 1% of word errors has a real cost, Scribe v2 is the clear pick — the gap is not within noise
  • 32-speaker diarization is genuinely differentiated: if you’re transcribing group interviews, town halls, or panel discussions, nothing else handles crowded files as gracefully
  • ElevenLabs has more total capital and infrastructure behind it than any pure-STT vendor, so the API’s reliability has been solid since Scribe v2 launched
  • The streaming variant gives you a credible real-time path without having to swap to a different vendor

Cons

  • Price is near the top of the market at $6.67 per 1K minutes (batch) — almost 10x Groq Whisper Turbo. If your use case can tolerate a couple of percentage points of WER, you’d overpay significantly
  • Scribe v2 Realtime is effectively a different product and worth treating as one. No diarization. And the real-time API has a server-side VAD that waits 2.5 seconds of silence before committing a turn by default — which translates to voice agents that respond 10-15 seconds after the user stopped talking, even in a quiet room. This is fixable (set a shorter silence threshold or disable server VAD and drive turn-taking from the client), but LiveKit builders hit it at launch. Verify your integration before committing
  • Multichannel recordings and diarization can’t be combined. Scribe v2 handles up to 5 channels with per-channel speaker ID, but if you need per-channel diarization across overlapping speakers, you’re picking one
  • Keyterm prompting is a list of phrases, not free-text context. If you want to prompt the model with a full paragraph of domain context (“this is a podcast about machine learning, the host is Lex, the guest is talking about gradient descent…”), AssemblyAI Universal-3 Pro’s natural-language prompting is the more flexible surface

Pricing

PlanPriceWhat’s Included
FreeLimited monthly charactersScribe v2 evaluation, no commercial use
Pay-as-you-go$6.67 per 1K minScribe v2 batch, diarization (up to 32 speakers), timestamps
Scribe v2 RealtimeMetered separatelyStreaming variant for voice-agent use
Scale (enterprise)CustomSLA, dedicated capacity, priority support

Deployment / Access

Managed API (REST + WebSocket for Realtime) · Python, Node, Java, .NET SDKs · Web playground for quick evaluation · SOC 2 Type II and GDPR-compliant · US and EU regions.

Who It’s For (and Who Should Skip It)

Pick Scribe v2 if word-level accuracy is the thing your downstream consumer cares about most — legal, medical, broadcast captioning, or any pipeline where a human reviews the transcript and hates fixing errors. Skip it if your build is price-sensitive (Groq Whisper Turbo at $0.67 is almost 10x cheaper with WER within noise for most audio) or if you need open weights for on-prem deployment (Voxtral Small is the accuracy-leader alternative you can self-host). Try ElevenLabs

4. Whisper Large v3 Turbo (via Groq): Best for price-sensitive builds

The Whisper model itself is three years old, and on the Hugging Face Open ASR Leaderboard it sits at position 37 — no longer benchmark-competitive. What makes it the fourth model in this roundup is not the weights, it’s Groq. Running Whisper Large v3 Turbo on Groq costs $0.67 per 1K minutes at 257x real-time speed — which is cheaper and faster than every other hosted option we tested. Accuracy is 4.8% word error rate on Artificial Analysis: noticeably worse than the top tier, still within the range where most applications won’t notice. For vibe coders who want an OpenAI-compatible endpoint, a generous free tier, and “just paste the URL and go,” this is the answer. There’s also a quietly important detail: the same Whisper weights produce wildly different WER depending on which host you pick. Groq and Fireworks land around 4.8%; fal.ai’s Whisper v3 gets 4.2%; Together.ai gets 7.4%; Replicate’s default Whisper endpoint gets 10.2% — a 2.4x gap on identical weights, entirely from how the provider batches, VADs, and configures the inference loop. If you care about Whisper quality, the host matters as much as the model.

Key Features

  • Cheapest hosted Whisper option with a credible quality floor — $0.67 per 1K minutes at 4.8% AA-WER
  • 257x real-time speed, meaning a one-hour recording transcribes in ~14 seconds
  • OpenAI-compatible API — drop-in replacement for OpenAI’s Whisper endpoint, so migration is a one-line change
  • Open weights (MIT license) — if Groq ever raises prices or sunsets the endpoint, you own the escape hatch
  • Massive ecosystem of wrappers and derivatives: whisper.cpp for CPU / Apple Silicon, Distil-Whisper for faster local inference, WhisperX for diarization + alignment, MacWhisper for consumer Mac apps

Pros

  • The “just works, cheap and fast” answer for vibe coders and side-project builders. Free tier is generous enough to prototype meaningfully
  • The OpenAI-compatible endpoint means you can develop against OpenAI’s Whisper API, then flip a URL and pay 10x less for 10x faster inference on the exact same model
  • The open-weight escape hatch is genuine value. Groq is a specific bet on a specific provider, but if that bet goes wrong you can run Whisper locally or move to Fireworks or fal.ai without rewriting your integration
  • The community ecosystem around Whisper is unmatched — any edge case you hit has probably already been solved in a GitHub issue somewhere

Cons

  • Whisper hallucinates on silence. This is the model’s most famous failure mode: on audio with long pauses, non-speech segments, or unusual recording conditions, the decoder can invent text that isn’t in the audio. For medical, legal, or any high-stakes transcription, pair it with a voice-activity-detection preprocessing step or use a different model — Scribe v2, Universal-3 Pro, and GPT-4o-transcribe all mitigate this better
  • Whisper Turbo is explicitly not trained for translation — if you need speech-in-one-language-to-English-text, use Whisper Large v3 (non-Turbo) via Groq’s separate endpoint, or Whisper-1 via OpenAI directly
  • No streaming variant. Groq’s Whisper endpoint is batch-only; if you need real-time streaming, move to Deepgram, AssemblyAI Universal-Streaming, or Scribe v2 Realtime
  • Diarization isn’t included — you’ll layer WhisperX or Pyannote on top, which adds complexity

Pricing

PlanPriceWhat’s Included
Free tierRate-limitedFull Whisper Turbo access for evaluation and low-volume personal use
Pay-as-you-go0.04perhour(0.04 per hour (0.67/1K min)Whisper Large v3 Turbo at full speed
Pay-as-you-go (Whisper v3)~$1.85 per 1K minNon-Turbo Whisper Large v3, supports translation
EnterpriseCustomDedicated capacity, SLA

Deployment / Access

Managed API (OpenAI-compatible REST) · Any HTTP client · Open weights also runnable locally via whisper.cpp, faster-whisper, or Transformers · MIT license on weights · Groq API is US-hosted.

Who It’s For (and Who Should Skip It)

Pick this if you’re prototyping, building a side project, running low-to-medium batch volumes on a budget, or need an OpenAI-compatible drop-in that costs almost nothing. Skip it if you need the absolute best accuracy (Scribe v2 or Voxtral Small), if hallucinations on silence are a dealbreaker (GPT-4o-transcribe or Cohere Transcribe handle silence better), or if you need streaming (Deepgram Nova-3 Flux). Try Groq Whisper

5. Google Gemini family: Best for mixed audio tasks

Gemini isn’t a transcription model in the traditional sense — it’s a multimodal LLM that happens to be very good at transcription, which means the question it answers differently from everything else in this roundup is “what else can I do with the audio at the same time?” Gemini 3 Pro lands at 2.9% word error rate — tied for second place on Artificial Analysis with Voxtral Small — while also being able to summarize the audio, answer questions about it, translate it, extract structured data from it, and follow instructions about what parts to include or exclude, all in one API call. The catch: it’s slow (8x real-time on Gemini 3 Pro, vs 257x for Groq Whisper Turbo) and expensive at the top tier (18.40per1Kminutes).TheactualinterestingentryinthisfamilyisGemini2.0FlashLite:at18.40 per 1K minutes). The actual interesting entry in this family is Gemini 2.0 Flash Lite: at 0.19 per 1K minutes with 4.0% accuracy, it’s the cheapest model in the entire category by a wide margin, and the quality is competitive with hosted Whisper variants that cost 3-8x more.

Key Features

  • The widest price span of any family in this roundup: 0.19per1Kminatthebottom(Gemini2.0FlashLite)to0.19 per 1K min at the bottom (Gemini 2.0 Flash Lite) to 18.40 at the top (Gemini 3 Pro High), all with similar architecture
  • Multimodal — you pass audio and a prompt, the model does transcription, translation, summarization, entity extraction, Q&A, or whatever you ask in one call
  • Long-form audio support: Gemini handles hour-long recordings in a single call without manual chunking
  • Instruction-following: you can tell Gemini to “transcribe only the interviewee’s responses” or “return the transcript in JSON with speaker tags,” and it mostly complies
  • Integrates natively into the Google Cloud ecosystem if you’re already there

Pros

  • Gemini 2.0 Flash Lite at $0.19 per 1K minutes is an absolute price floor — we can’t find any other model at this tier that matches its accuracy
  • If your application genuinely needs “transcribe AND summarize AND extract action items” in one pass, Gemini collapses two or three API calls into one and often costs less than the sum of its parts
  • Long-form audio handling is smoother than the chunked, reassembled approach most dedicated ASR models use
  • The instruction-following lets you do things that are awkward elsewhere, like “transcribe only when the CEO is speaking, skip the Q&A”

Cons

  • Speed is the obvious weakness: Gemini 3 Pro runs at 8x real-time, compared to 257x for Groq Whisper Turbo or 458x for Deepgram Nova-2. For high-volume batch pipelines, that’s a real throughput ceiling
  • Streaming is weak. Gemini has a Live API that supports audio input, but it’s not optimized for low-latency voice-agent use the way Deepgram Flux or AssemblyAI Universal-Streaming are
  • The Pro-tier pricing is genuinely high: at $18.40 per 1K minutes, Gemini 3 Pro is ~28x the price of Groq Whisper Turbo for a 1-point accuracy improvement that most applications won’t care about
  • Word-level timestamps and diarization aren’t native the way they are in dedicated ASR APIs — you’ll ask for them in the prompt and hope the model complies

Pricing

PlanPriceWhat’s Included
Free tierGenerous rate-limitsGemini 2.0 Flash Lite, Flash, and some 2.5 Flash access
Gemini 2.0 Flash Lite$0.19 per 1K minCheapest in the set; 4.0% AA-WER
Gemini 2.5 Flash~$6.66 per 1K min5.3% WER but much faster than Pro tiers
Gemini 2.5 Pro$11.39 per 1K min3.0% WER, balanced
Gemini 3 Pro (High)$18.40 per 1K minTop accuracy (2.9%), slowest, multimodal reasoning

Deployment / Access

Managed API via Google AI Studio and Vertex AI · Python, Node, Go, Java SDKs · Multi-region hosting on Google Cloud · Long-context audio handling via Files API · Generous free tier for evaluation.

Who It’s For (and Who Should Skip It)

Pick a Gemini model if your product benefits from combining transcription with reasoning — “summarize this meeting and extract action items” is a one-call operation for Gemini. Pick Gemini 2.0 Flash Lite specifically if you’re price-sensitive and willing to accept slightly lower speed in exchange for the lowest $/hour of audio in the entire category. Skip Gemini if you need real-time streaming voice-agent latency (Deepgram Nova-3 Flux is the answer) or if throughput is your bottleneck on a high-volume batch pipeline (Parakeet or Whisper Turbo will transcribe 10-30x faster). Try Gemini

6. Mistral Voxtral Small 24B: Best open-weight accuracy

Voxtral Small is the only model in this roundup that is both a genuine accuracy leader and open-weight under a permissive license. On Artificial Analysis it lands at 2.9% word error rate — tied for second, behind only ElevenLabs Scribe v2. And it’s Apache 2.0, which means you can self-host it, fine-tune it, and ship it inside your product without attribution requirements. There’s one catch that matters: Voxtral Small is 24 billion parameters and wants around 55GB of GPU memory in bfloat16 — that’s an H100 or A100 80GB, or two A100 40GBs in tensor-parallel configuration. For a team with real infrastructure, Voxtral Small is transformative. For a solo developer on a single 4090, the sibling 3B Voxtral Mini (which runs in about 9.5GB of memory) is the realistic local option, and a 4B Voxtral Mini Realtime was released in February 2026 as the first credible open-weight streaming transcription model. The other thing Voxtral does that no other model in this roundup does at this scale: because it’s a speech-augmented LLM, it can answer questions about the audio and summarize it in the same call, without a separate LLM post-processing step.

Key Features

  • AA word error rate of 2.9% — tied with Gemini 3 Pro for second place on the Artificial Analysis leaderboard, and the clear #1 open-weight model on that benchmark
  • Apache 2.0 license on the weights (including the Small, Mini, and Mini Realtime variants) — commercial use with no attribution requirement
  • Speech-augmented LLM architecture: the same model transcribes, summarizes, answers questions about the audio, and translates — all in one forward pass with 32K context (about 30 minutes of audio in a single call)
  • Voxtral Mini 3B runs on a single 24GB consumer GPU — self-hosting is realistic for individual developers who pick the Mini variant
  • Voxtral Mini Realtime (Feb 2026) is the first credible open-weight model for real-time streaming ASR, with latency configurable between 80 milliseconds and 2.4 seconds
  • Day-zero vLLM support means you can run it with a production serving framework out of the box; Red Hat AI, SageMaker, OpenRouter, Groq, Together, and DeepInfra all host Voxtral variants

Pros

  • The combination of “near-best-in-class accuracy, permissively licensed, runs on commodity infrastructure” is genuinely rare and worth building on if it matches your profile
  • Native audio Q&A and summarization eliminates a whole class of pipeline complexity — you don’t need to transcribe, then feed the transcript through a separate LLM, then reconcile the two outputs
  • Mistral’s own hosted API is aggressively priced: 4per1KminutesforfullVoxtral,4 per 1K minutes for full Voxtral, 1 per 1K minutes for Voxtral Mini Transcribe — cheaper than most proprietary APIs at comparable accuracy
  • The community inference-engine ecosystem is moving fast: pure-C and Rust ports have shipped for Voxtral Mini Realtime, suggesting serious commitment to portability

Cons

  • Voxtral Small 24B is genuinely not a solo-developer model. 55GB VRAM forces you to rent an H100 or A100 — if you’re a one-person team without cloud infrastructure, pick Voxtral Mini 3B or Voxtral Mini Realtime instead
  • The “99 languages” claim you’ll see in some early coverage is not supported by Mistral’s own documentation. Their blog lists 8 languages explicitly plus automatic detection; Voxtral Realtime covers 13. For a tail language, test before committing
  • The speech-augmented LLM design inherits a prompt injection risk: the audio itself can contain instructions that the model will follow, even when your system prompt tells it not to. Simon Willison documented this at launch. If you’re processing user-supplied audio, treat Voxtral Q&A as you would any LLM with untrusted input — assume it can be hijacked

Pricing

PlanPriceWhat’s Included
Open weightsFree (Apache 2.0)Self-host Voxtral Small 24B, Mini 3B, Mini 4B Realtime
Mistral API (Voxtral full)$4 per 1K minHosted Voxtral Small for transcription + Q&A
Mistral API (Voxtral Mini Transcribe)$1 per 1K minCheapest Voxtral tier — transcription only
Mistral API (Voxtral Realtime)$6 per 1K minStreaming variant
Third-party hostsVariesOpenRouter, Together, DeepInfra, Cloudflare all host Voxtral variants

Deployment / Access

Open weights on Hugging Face (Apache 2.0) · Mistral La Plateforme hosted API · vLLM Day-0 support · Third-party hosts: OpenRouter, Together, DeepInfra, Cloudflare · SageMaker deployment guide · Red Hat AI Day-1 integration · Community pure-C and Rust inference ports.

Who It’s For (and Who Should Skip It)

Pick Voxtral Small if you have real infrastructure (cloud GPUs, Kubernetes, vLLM deployment) and want the accuracy-leading open-weight model you can fine-tune and host yourself. Pick Voxtral Mini if you’re a solo developer with a consumer GPU who wants a local open-weight option that outperforms Whisper. Pick Voxtral Realtime if you’re building a voice agent and want to self-host the streaming piece. Skip Voxtral entirely if you need managed infrastructure and predictable billing without the self-hosting decision — AssemblyAI Universal-3 Pro or Deepgram Nova-3 are simpler. Try Voxtral

7. NVIDIA Parakeet TDT 0.6B: Best for batch throughput

Parakeet is the fastest open-weight transcription model by a very large margin. On Hugging Face’s hardware-normalized benchmark it hits 3,386x real-time — Whisper Large v3 gets ~216x under the same conditions, and the nearest competitor in Parakeet’s accuracy tier gets ~1,000x. For teams running large batch pipelines (podcast archives, meeting recordings, video caption generation), the economics flip: Modal published a case study transcribing a full week of audio in a single minute for $1, documenting 112x faster and 200x cheaper than a proprietary API at matched error rates. Substack and Zencastr are named production users. The other thing Parakeet does well is community adoption — in the 12 months since launch, six community inference ports have shipped (Parakeet.cpp for Metal GPU, parakeet-mlx for Apple Silicon, parakeet-rs for Rust, plus consumer apps Handy and Hex), and MacWhisper integrated it alongside Whisper as a selectable model. The trade-off is specific: Parakeet is best-in-class on narrated single-speaker audio and mediocre on spontaneous meetings — its AMI benchmark score (11.16) is notably worse than Cohere Transcribe (8.15) or IBM Granite (8.44) on the same test.

Key Features

  • 3,386x real-time speed (v2) / 3,332x (v3) on Hugging Face’s benchmark — the fastest open model by a wide margin; roughly 15x Whisper Large v3
  • CC-BY-4.0 license (commercial use permitted with attribution) — weights and code are available for self-hosting and fine-tuning
  • v2 is English-only; v3 (August 2025) adds 24 European languages with automatic language detection
  • 24-minute single-pass context on an A100 80GB — longer than most transformer ASR can handle without chunking
  • Runs on any NVIDIA GPU from V100 onward (2GB RAM minimum) — the actual broad hardware compatibility story in the open-weight tier
  • Word, segment, and character-level timestamps built in

Pros

  • When batch throughput is the metric, Parakeet’s speed advantage is not marginal — it’s an order of magnitude faster than Whisper at comparable accuracy
  • The community inference port ecosystem is unmatched: if you need Apple Silicon, Rust, WebGPU, CPU-only, or any other runtime, someone has already built it
  • MacWhisper integration is the consumer-facing validation: the leading paid macOS transcription app uses Parakeet as a user-selectable model alongside Whisper
  • The 24-minute single-pass context eliminates a class of chunk-boundary bugs (lost words at segment joins) that plague most ASR pipelines on long audio

Cons

  • Installation friction is real. The NeMo framework is CUDA-version-picky on Linux, and Windows installs are widely reported as painful. For non-ML-experts, use the community consumer apps (Handy, MacWhisper) or the community inference ports — don’t wrestle with NeMo directly
  • Parakeet is weak on real meetings. Its 11.16 AMI score puts it behind Cohere Transcribe, IBM Granite, and Whisper-family derivatives on spontaneous multi-speaker audio. If your pipeline is meetings-heavy, pick Cohere or test alternatives first
  • No native diarization — you’ll layer NVIDIA Sortformer or a Pyannote diarizer on top. For single-speaker audio (podcasts, narrated video) this doesn’t matter; for meetings it’s another integration burden
  • v2’s English-only limit sends multilingual users to v3, which dropped slightly in English WER (6.34 vs 6.05) to make room for multilingual training. Pick your variant based on your primary language

Pricing

PlanPriceWhat’s Included
Open weightsFree (CC-BY-4.0)Self-host Parakeet TDT 0.6B v2 (English) or v3 (multilingual)
NVIDIA NIMPay per requestHosted inference on NVIDIA-managed infrastructure
Together AI (v3 hosted)MeteredThird-party managed endpoint
Replicate (Parakeet RNNT)~$1.91 per 1K minHosted Parakeet variant on Replicate

Deployment / Access

Open weights on Hugging Face (CC-BY-4.0) · NVIDIA NIM hosted API · Together AI for v3 · NVIDIA NeMo framework for native inference · Community ports: Parakeet.cpp (Metal), parakeet-mlx (Apple Silicon), parakeet-rs (Rust/ONNX), FluidAudio (Swift/Apple Neural Engine) · Consumer apps: Handy, Hex, MacWhisper.

Who It’s For (and Who Should Skip It)

Pick Parakeet if you’re running a large batch pipeline (podcasts, video captions, archive transcription) on narrated or single-speaker audio and speed/cost matters. Skip it if your primary workload is meeting transcription (Cohere Transcribe or Whisper wins on AMI), if you need diarization in the base API (AssemblyAI or Scribe v2 handle it natively), or if you need streaming voice-agent latency (Deepgram Nova-3 Flux or Voxtral Realtime). Try Parakeet

8. OpenAI GPT-4o-transcribe: Best for OpenAI-native stacks

GPT-4o-transcribe is OpenAI’s deliberate positioning of a Whisper successor. It lands at 4.1% word error rate on Artificial Analysis — better than hosted Whisper variants (4.2-4.9%) but meaningfully behind the current accuracy leaders (Scribe v2 at 2.3%, Voxtral Small and Gemini 3 Pro at 2.9%). The interesting things it does that Whisper doesn’t: it supports a prompt parameter that actually works (feed it acronyms, product names, speaker context), it’s the only way to get streaming transcription through the OpenAI Realtime API, and a diarization-enabled variant (gpt-4o-transcribe-diarize) handles multi-speaker audio natively. The wrinkle: OpenAI shipped a silent snapshot update in late 2025 that reverted some users’ setups — a reminder that “stable proprietary API” is not the same as “stable model version” even from OpenAI. For teams that are already routing everything through OpenAI and don’t want to add a second vendor, GPT-4o-transcribe is the obvious pick. For teams comparing it against alternatives, it’s rarely the best value on any single dimension.

Key Features

  • 4.1% AA-WER — better than any hosted Whisper variant; worse than the accuracy leaders
  • Prompt parameter that meaningfully improves transcription on jargon-heavy audio (fed acronyms, product names, speaker context)
  • Native streaming support through the OpenAI Realtime API — the only first-party OpenAI path to streaming transcription
  • gpt-4o-transcribe-diarize variant adds native multi-speaker separation
  • GPT-4o-mini-transcribe at $3 per 1K minutes is a meaningful price drop from full GPT-4o-transcribe with accuracy only slightly worse (4.6% vs 4.1%)
  • Already inside OpenAI’s SDK — if you’re using openai.audio.transcriptions.create() today, swapping from Whisper-1 is a model ID change

Pros

  • If you’re building on OpenAI and want one vendor, GPT-4o-transcribe is feature-complete: batch, streaming, diarization, timestamps, prompting
  • The prompting behavior is genuinely useful — you can ship domain accuracy without fine-tuning just by populating a good prompt with product vocabulary
  • OpenAI’s Realtime API is well-documented and reliable; if you’re already building real-time voice features with GPT-4o audio, adding transcription is a free upgrade
  • GPT-4o-mini-transcribe’s price-to-accuracy ratio is competitive for medium-volume pipelines

Cons

  • A silent snapshot update to gpt-4o-transcribe in late 2025 produced complaints from developers tracking API stability. The lesson is that “use the model ID string” doesn’t guarantee bit-for-bit reproducibility over time — if you need truly stable transcription behavior, use open weights or pin-version explicitly
  • $6 per 1K minutes is ~9x the price of Groq Whisper Turbo for accuracy that’s only ~0.7 percentage points better — a gap most applications won’t notice
  • No open-weight option. If you care about hallucination-free transcription on silence, the underlying Whisper issue isn’t documented as fixed here; OpenAI claims improvement, but we don’t have independent benchmarks
  • Simon Willison flagged prompt injection via audio as a theoretical concern at launch — if you’re transcribing user-supplied audio and then taking action based on the transcript, assume the audio can contain instructions

Pricing

PlanPriceWhat’s Included
Free tierLimited monthly usageEvaluation via OpenAI Playground
GPT-4o-transcribe$6 per 1K minTop OpenAI transcription model; prompting support
GPT-4o-mini-transcribe$3 per 1K minBudget tier; 4.6% WER
GPT-4o-transcribe-diarizeMeteredMulti-speaker separation
Realtime APISeparate pricingStreaming transcription via GPT-4o audio

Deployment / Access

Managed API through OpenAI platform · Python, Node, .NET, Java SDKs · US and EU regions · Realtime API for streaming via WebSocket · Integrates with Assistants API · No self-host option.

Who It’s For (and Who Should Skip It)

Pick GPT-4o-transcribe if you’re already routing everything through OpenAI and want to keep one vendor, or if you specifically need prompting behavior for domain vocabulary without fine-tuning. Skip it if price is a real constraint (Groq Whisper Turbo runs $0.67 per 1K minutes) or if you want accuracy leadership (Scribe v2 or Voxtral Small are a point or more better). Try OpenAI Transcribe

9. NVIDIA Canary-Qwen-2.5B: Best small open-weight model

Canary-Qwen is the small open-weight model that punches above its weight. On Hugging Face’s Open ASR Leaderboard it lands at position 4 with 5.63% word error rate — beating Parakeet, Qwen3-ASR, and Whisper at a fraction of the size. Two things make it attractive: it’s only 2.5 billion parameters (fits comfortably on a 4090), and it’s a speech-augmented LLM, meaning it can both transcribe and run its decoder in LLM mode to summarize or answer questions about the transcript. The “catch” is a big one: Canary-Qwen is hard-capped at approximately 40 seconds of audio per inference call. That’s a defining limitation — for voicemail, short command-and-control, or pre-chunked audio it’s fine, but anything longer (podcasts, meetings) requires external chunking logic and loses the benefit of a single context. It’s also English-only despite the encoder’s multilingual pretraining, and there’s no vLLM or TensorRT support — you’re inside NVIDIA’s NeMo framework for inference.

Key Features

  • #4 on the Hugging Face Open ASR Leaderboard at 5.63% word error rate — the best open-weight result at the 2.5B parameter scale
  • 2.5B parameters — fits in about 5GB of VRAM, runs comfortably on any RTX 4090 / A10 / L4
  • Two-mode design: ASR mode transcribes audio; LLM mode runs Qwen3-1.7B on the resulting transcript for summarization or Q&A
  • CC-BY-4.0 license — commercial use permitted with attribution
  • Punctuation and capitalization in the transcript output — no post-processing needed
  • 418x real-time speed on an A100

Pros

  • Best accuracy-per-parameter of any open-weight model we tested — a 2.5B model shouldn’t outperform 24B Voxtral Small on Hugging Face’s leaderboard, but Canary-Qwen does
  • Runs on commodity hardware. If you have a 4090 and want a top-4 open-weight transcription model, this is the only option that fits
  • The two-mode design gives you transcript-level LLM reasoning without bolting on a separate LLM — for short audio clips that need structured extraction, Canary-Qwen is the cleanest pipeline
  • Light weights plus vendor GPU support down to the RTX 5090 / A10 tier — it’s genuinely accessible

Cons

  • The 40-second audio cap is the defining constraint. This is not a long-form model. Anything longer than about 40 seconds requires external chunking — and the segment boundaries can lose words. For meetings, podcasts, or any long-form work, pick Parakeet, Voxtral, or Whisper instead
  • No streaming. Canary-Qwen is batch-only. Voice-agent builders should skip it entirely
  • English-only, despite the encoder having seen German, French, and Spanish during pretraining. The model card is explicit that non-English use is unreliable
  • Deployment friction is real: no vLLM or TensorRT support as of April 2026, so you’re running NeMo’s own inference stack or paying for Replicate’s hosted endpoint. This disqualifies Canary-Qwen from any serving platform you’d pick for production
  • The two-mode design means LLM mode doesn’t actually see the audio — it only reasons on the transcript. If you need prosody or tone awareness (e.g., “was the speaker upset?”), this model can’t answer

Pricing

PlanPriceWhat’s Included
Open weightsFree (CC-BY-4.0)Self-host via NVIDIA NeMo framework
Replicate (hosted)$0.74 per 1K minCanary-Qwen-2.5B hosted endpoint

Deployment / Access

Open weights on Hugging Face (CC-BY-4.0) · NVIDIA NeMo framework for native inference · Replicate hosted API · Community RunPod and Cog wrappers · No vLLM or TensorRT support yet.

Who It’s For (and Who Should Skip It)

Pick Canary-Qwen if you’re transcribing short audio clips (under 40 seconds) on a consumer GPU and want the best open-weight accuracy at that scale — voicemail transcription, command-and-control, short voice notes. Skip it for anything longer (Parakeet or Voxtral) or anything requiring streaming or multilingual (pick AssemblyAI, Deepgram, or Voxtral Realtime). Try Canary-Qwen

10. Cohere Transcribe 03-2026: Best benchmark leader today

Cohere Transcribe currently sits at #1 on the Hugging Face Open ASR Leaderboard with a 5.42% average word error rate — beating Zoom Scribe, IBM Granite, Canary-Qwen, ElevenLabs Scribe v2 (on this benchmark), and every Whisper variant. It’s also Apache 2.0 and available through a soft contact-info gate on Hugging Face (not a waitlist — you fill out a short form and immediately get access; the community has pulled down the weights over 170,000 times in the first month). What makes the #1 position especially credible is where Cohere wins — it’s the top model on AMI meetings (8.15 WER) while simultaneously topping LibriSpeech clean (1.25 WER), an unusual combination that suggests real generalization rather than a dataset-specific overfit. The caveat: this is a brand-new model and there are no production case studies yet. The quote in Cohere’s launch blog is from a VC, not a customer. Two other things matter for the reader evaluating this: Cohere Transcribe has no streaming, no native diarization, and no word-level timestamps. It’s best-in-class for raw text-from-audio — not a replacement for a voice-agent stack or a meeting transcription product that needs speaker separation.

Key Features

  • #1 on the Hugging Face Open ASR Leaderboard (April 2026): 5.42% average WER, best AMI meeting score in the top 10
  • Apache 2.0 license with hosted API, open-weight download, and vLLM upstream support all live at launch
  • 14 languages with deliberate coverage: 9 European, 4 Asian (including Japanese, Chinese, Korean, Vietnamese), and Arabic
  • vLLM integration contributed by Cohere as an upstream pull request including variable-length audio handling — serving as a one-line vllm serve command
  • Hosted API with a free rate-limited trial tier plus Cohere Model Vault for dedicated deployments
  • Rapid ecosystem: within the first 17 days post-launch, the community shipped an MLX port for Apple Silicon, a Rust port, a Chrome extension, iOS and Android app integrations, and ONNX quantizations

Pros

  • Benchmark leadership is real and broad — wins on both meeting-style audio (AMI) and clean audiobook audio (LibriSpeech) simultaneously, which most models can’t do
  • Apache 2.0 plus a free hosted API plus one-line vLLM support means Cohere Transcribe is the easiest accuracy-leading model to actually start using today
  • 14 languages with Asian and Arabic coverage is a stronger multilingual story than Parakeet v3 (European-only) for builders working outside the Western market
  • Ecosystem velocity in the first 30 days has been unusual — community inference ports and integrations that took Parakeet months shipped here in under three weeks

Cons

  • No streaming, no native diarization, no word-level timestamps. If your product is a voice agent or a meeting tool, Cohere Transcribe isn’t directly usable — you’d need to layer a separate diarizer (Sortformer or Pyannote) and ship without timestamps
  • No production case studies exist yet — the model is 17 days old as of this writing. The Radical Ventures quote in the launch blog is a VC testimonial, not a deployed customer. Treat it as a “watch closely, verify before committing” pick until someone ships a real production story
  • 14 languages is a deliberate quality-over-quantity choice, but it does mean tail languages supported by Whisper (~99 claimed) aren’t here. For those use cases, this isn’t your model
  • Cohere’s Model Vault pricing for dedicated Transcribe deployments is not published on the public pricing page — you need a sales conversation for the managed enterprise path

Pricing

PlanPriceWhat’s Included
Free trialRate-limitedHosted API access for prototyping
Cohere APIVaries (not published on main page)Pay-per-request hosted transcription
Open weightsFree (Apache 2.0, contact-gated HF download)Self-host with vLLM or Transformers
Model VaultCustomDedicated managed deployment for production

Deployment / Access

Managed API on api.cohere.com · Open weights on Hugging Face via a one-step contact-info form · vLLM Day-0 upstream support (vllm serve CohereLabs/cohere-transcribe-03-2026 --trust-remote-code) · Community MLX, Rust, and Chrome extension ports · Cohere Model Vault for dedicated managed deployments.

Who It’s For (and Who Should Skip It)

Pick Cohere Transcribe if you’re building a batch transcription pipeline where accuracy matters above all else and you can tolerate being an early adopter — or if you specifically need its Japanese, Chinese, Korean, Vietnamese, or Arabic coverage with top-tier accuracy. Skip it if you need streaming (Deepgram Nova-3, Voxtral Realtime, AssemblyAI Universal-Streaming), if you need native diarization (Scribe v2, AssemblyAI Universal-3 Pro), or if you need word-level timestamps (any other model in this roundup). Try Cohere Transcribe

11. Speechmatics Enhanced: Best for non-English and code-switching

Speechmatics is the only model in this roundup built around multilingual as a first-class concern — and the only one that ships in on-premises, container, virtual-appliance, and on-device deployment modes alongside its hosted API. “Enhanced” isn’t a separate model, by the way; it’s an operating_point setting on the same underlying model that Standard uses. Pick Enhanced and you get 4.3% word error rate on Artificial Analysis (competitive with the top proprietary tier) at ~$0.40 per hour in batch; pick Standard and you save 40% in exchange for roughly one extra point of error. Speechmatics supports roughly 50 single-language packs plus 7 bilingual packs (Arabic-English, Mandarin-English, Spanish-English, Tamil-English, Tagalog-English, Malay-English, Mandarin-Malay-Tamil-English) — and in March 2026 launched an Arabic-English model that handles mid-sentence code-switching, the hardest problem in bilingual transcription. For anyone building a non-English product, or for regulated-industry buyers who can’t send audio to a SaaS API at all, Speechmatics is the shortlist entry that actually makes sense.

Key Features

  • ~50 single-language packs plus 7 bilingual packs, with Arabic-English, Mandarin-English, Spanish-English, Tamil-English, Tagalog-English, Malay-English, and Mandarin-Malay-Tamil-English as deliberate multilingual products, not afterthoughts
  • Arabic-English bilingual model (March 2026) specifically handles mid-sentence code-switching — Speechmatics’ own reported numbers are 6.3% WER vs Google’s 9.7% on code-switched audio, a 35% relative improvement (vendor-reported, not independently replicated)
  • operating_point setting — standard is the default (5.3% WER, ~4/1Kmin),enhancedisthequalityoptimizedpath(4.34/1K min), `enhanced` is the quality-optimized path (4.3% WER, ~6.70/1K min) on the exact same underlying model
  • Real-time and batch share the same model; both return partial transcripts within a few hundred milliseconds in streaming mode
  • Deployment modes unmatched by any other entry in this roundup: SaaS cloud, Private Cloud, Container, Virtual Appliance, and On-Device. For regulated industries where audio cannot leave your network, this is the only shortlist option that isn’t open-weight
  • Custom dictionary with pronunciation hints (sounds_like) — lighter-weight than ElevenLabs’ keyterm prompting or AssemblyAI’s natural-language prompting, but useful for domain vocabulary
  • Native speaker, channel, and combined diarization modes — real-time diarization is supported in both streaming and batch

Pros

  • For any multilingual product, Speechmatics is the realistic option. No other model in this roundup treats non-English as a first-class concern — and Trelis Research’s independent practitioner review explicitly recommends Speechmatics for multilingual work
  • Code-switching handling on the Arabic-English model is a genuine first-of-its-kind capability — mid-utterance language switching is a hard problem, and the same architecture backs the SEA (Malay/Tamil/Tagalog + English) bilingual packs
  • On-premises, container, virtual-appliance, and on-device deployment are first-class product tiers, not enterprise add-ons. For healthcare, finance, government, and defense buyers who cannot send audio to a SaaS API, this is the deciding factor
  • The operating_point design lets you match accuracy spend to your use case without rewriting the integration — a one-line change switches from Standard to Enhanced

Cons

  • Speechmatics’ English-language community presence is thin — very few independent reviews, limited HN and Reddit discussion, most coverage is vendor-produced. If you’re debugging an edge case, you’re more on your own than you would be with Deepgram or AssemblyAI
  • Speechmatics Standard at 5.3% WER is meaningfully worse than the top tier, so the “budget tier” isn’t price-competitive with Groq Whisper Turbo for English-only workloads
  • The “55+ languages” marketing number is really about 50 single-language packs plus 7 bilingual packs — still the broadest in this roundup, but don’t read it as “every tail language Whisper claims.” Language identification has documented exceptions (Interlingua, Esperanto, Uyghur, Cantonese, Irish, Maltese, Urdu, Bengali, Swahili) that must be set explicitly rather than auto-detected
  • Pricing is closer to the top of the market than the bottom for real-time use — at $0.56 per hour for real-time Enhanced, it’s noticeably pricier than Deepgram Nova-3 or AssemblyAI Universal-Streaming for voice-agent workloads

Pricing

PlanPriceWhat’s Included
Free trialLimited creditsStandard and Enhanced evaluation
Standard~$4 per 1K minSpeed-optimized, 55+ languages, real-time and batch
Enhanced~$6.70 per 1K minAccuracy-optimized, 55+ languages, real-time and batch
EnterpriseCustomOn-premises / VPC deployment, SLA, dedicated support

Deployment / Access

Managed API (REST + WebSocket for real-time) · Python, Node, Java SDKs · Cloud hosted in UK and US · On-premises and containerized deployment for enterprise · SOC 2 and GDPR compliant.

Who It’s For (and Who Should Skip It)

Pick Speechmatics if your product needs to handle non-English audio as a first-class concern, if you’re transcribing audio where speakers code-switch between languages, or if your enterprise needs on-premises deployment for data residency. Skip it if you’re English-only and price-sensitive (Groq Whisper Turbo is ~10x cheaper), or if you need the absolute best English accuracy (Scribe v2 or Voxtral Small beat it there). Try Speechmatics

Selection Guide

If you’re not sure where to start, this is the shortest path from “what are you building?” to “which model”:
  • Building a voice agent, customer-support bot, or any real-time phone or voice interface → Deepgram Nova-3 (with Flux for turn detection)
  • Building a meeting transcription, podcast transcription, or general batch pipeline and you want one vendor that covers the most ground → AssemblyAI Universal-3 Pro
  • You need the absolute lowest word error rate and price is not the primary constraint → ElevenLabs Scribe v2
  • You’re prototyping, building a side project, or running low-to-medium batch volume on a budget → Whisper Large v3 Turbo on Groq ($0.67/1K min)
  • You want the cheapest hosted option that still delivers quality → Gemini 2.0 Flash Lite ($0.19/1K min)
  • You’re already building on OpenAI and want to keep one vendor → GPT-4o-transcribe
  • You need to self-host for privacy, compliance, or cost at high volume → Mistral Voxtral Small (if you have serious GPU infra) or Voxtral Mini 3B (if you’re on a consumer GPU)
  • You’re running a high-throughput batch pipeline and speed per dollar dominates → NVIDIA Parakeet TDT 0.6B v2 (English) or v3 (with European languages)
  • You need a small open-weight model for short audio clips on a 4090 → NVIDIA Canary-Qwen-2.5B
  • You want the current accuracy leader and you can handle “no streaming, no diarization, no timestamps” → Cohere Transcribe 03-2026
  • Your product is non-English-first or needs to handle Arabic-English code-switching → Speechmatics Enhanced

How We Tested

We evaluated 52 transcription models across every major benchmark, provider page, community forum, and independent practitioner test we could find. The 11 models in this roundup are the ones that survived: each owns a defensible asymmetry — something it does better than everything else in the set — and has either top-tier benchmark position or meaningful community adoption signaling that it’s worth the reader’s attention. We do not use affiliate links, accept sponsorships, or take payment from model providers. Our recommendations are based entirely on our testing and research.

Selection Criteria

We scored every model across six dimensions:
  • Accuracy on real-world audio — Word error rate on the Hugging Face Open ASR Leaderboard (eight English datasets including LibriSpeech, TED-LIUM, GigaSpeech, and AMI meetings) and on Artificial Analysis (which weighs voice-agent-style audio more heavily). We treat the two benchmarks as complementary because they disagree in useful ways.
  • Latency and streaming support — Whether the model has a real streaming variant, what time-to-first-token looks like, and how turn detection is handled.
  • Deployment mode — Managed API, open weights, or both, and what hardware the open-weight models actually need.
  • Price per 1,000 minutes of audio — Normalized across different billing models (per-character, per-token, per-minute, per-request) so they’re directly comparable.
  • Feature suite — Diarization, word-level timestamps, custom vocabulary, language coverage, PII redaction, prompting support.
  • Community adoption signal — Whether builders are actually shipping with this in production, and what kinds of problems they run into.

How We Tested

We pulled the current Hugging Face Open ASR Leaderboard and Artificial Analysis STT leaderboard directly (scraped 2026-04-12, most recent leaderboard update 2026-04-10) so the benchmark data we rely on is current rather than quoted from old blog posts. For each shortlisted model, we read vendor pricing and documentation pages, model cards and changelogs, third-party practitioner blogs (Modal, Trelis Research, Gladia), independent evaluations (9to5Mac, Simon Willison), and community discussion on Hacker News and r/LocalLLaMA. Where a model’s feature set is documented (e.g., Whisper’s hallucination behavior, Canary-Qwen’s 40-second limit, Parakeet’s install friction), we flag it in the relevant model’s entry rather than burying it.

Models We Left Out (and Why)

Models That Didn’t Make the Cut

  • IBM Granite Speech 3.3 — Strong benchmark positions (HF top 8), Apache 2.0 license. Cut because community adoption is effectively zero outside IBM’s enterprise customer base, and the model targets a buyer profile (regulated enterprise, on-premises deployment) that’s different from this article’s audience.
  • Zoom Scribe v1 — Shows up at Hugging Face position #2 but it’s not actually purchasable as a general-purpose STT API. Zoom exposes it only through their developer platform to apps inside Zoom, which makes it irrelevant for anyone not building on Zoom.
  • Alibaba Qwen3-ASR — Top 10 on Hugging Face with strong multilingual coverage including 22 Chinese dialects. Cut because community signal in the English-language developer ecosystem is thin. If you’re building in the Chinese market, evaluate Qwen3-ASR directly.
  • Microsoft Phi-4 Multimodal — Unified text, vision, and speech in a single 5.6B model. Interesting as a research object, thin as a production transcription choice — the speech piece is a capability inside a multimodal LLM, not a focused STT product.
  • Amazon Transcribe — Competent but priced at $24 per 1K minutes on Artificial Analysis — the worst price in the entire set for middling quality. If you’re deep in AWS, evaluate it directly; we can’t recommend it broadly.
  • Azure MAI-Transcribe-1 — Microsoft’s current flagship STT (3.0% AA-WER according to Artificial Analysis). Thin community signal and Azure-gated positioning. If you’re already on Azure, evaluate directly.
  • Rev.ai Fusion — Rev’s in-house ASR. Lands at Hugging Face position 28 with 7.12% WER — middling. Rev’s brand value is in their human-transcription pipeline, not model quality.
  • Meta SeamlessM4T v2 — 101-language ASR + translation model. Non-commercial license (CC-BY-NC-4.0) disqualifies it for most production use.
  • Aqua Avalon, Smallest.ai Pulse STT, Z.AI GLM-ASR-Nano, Kyutai STT — All present on a benchmark leaderboard, all with essentially zero community signal or production adoption. Watch for their next release.

Adjacent Categories

If you’re looking for a ready-to-use product rather than a model to build on, see our separate roundup of AI transcription tools (Otter, Descript, MacWhisper, Superwhisper, Rev). The models in this article are the underlying pieces those tools build on — some of them literally use the same weights.

What You Need to Know Before Using Transcription Models

Transcription carries risks that generic AI APIs don’t. Before shipping a product with any of these models in the path, understand three practical concerns.

Hallucinations and High-Stakes Transcription

Whisper has a well-documented failure mode: on audio with long pauses, non-speech segments, or unusual recording conditions, the decoder can invent text that isn’t in the audio. A 2024 academic study found about 1% of Whisper transcriptions contained hallucinated content — 38% of those included what the researchers called “explicit harms” like fabricated violence or false medical statements. For podcast transcription or meeting summaries, this is a minor annoyance you can fix on review. For medical dictation, legal depositions, or any pipeline feeding an automated action, it’s a hard problem. If your use case is high-stakes, either use a model with documented hallucination mitigation (GPT-4o-transcribe, Cohere Transcribe, ElevenLabs Scribe v2) or pair Whisper with a voice-activity-detection preprocessing step to filter silence before decoding.

Data Handling and Privacy

Managed STT APIs receive your audio on their infrastructure, transcribe it, and often retain some portion of that data for quality monitoring or model improvement unless you explicitly opt out. For any audio containing personally identifiable information, medical content, or attorney-client communication, read the data retention and training policy of your chosen provider before committing. AssemblyAI, Deepgram, and Speechmatics all offer HIPAA BAAs on enterprise tiers. Voxtral, Parakeet, and Cohere Transcribe give you an open-weight escape hatch — you can run them on your own infrastructure and the audio never leaves your network. If your product handles regulated data, this is the single most important axis to evaluate. Transcribing a conversation is trivial technically but carries legal weight in many jurisdictions. In two-party-consent states (California, Florida, Illinois, and several others in the US) you generally cannot record a call without all parties’ consent. GDPR in the EU treats voice recordings as personal data with specific retention and deletion rights. Most transcription products build “call recording announcement” features for good reason. Before shipping a product that transcribes audio, verify your consent and disclosure approach with legal counsel — this is not a question the STT vendor will solve for you.

Frequently Asked Questions

For a brand-new project, no. Whisper Large v3 sits at position 37 on the current Hugging Face Open ASR Leaderboard — solidly out of the top tier. But “Whisper” as an ecosystem is different from “Whisper as a model.” Running Whisper Large v3 Turbo on Groq is still one of the cheapest and fastest credible options in the category ($0.67 per 1K minutes, 257x real-time, 4.8% WER), and the surrounding ecosystem of derivatives (Distil-Whisper, CrisperWhisper, whisper.cpp, WhisperX) remains the deepest in the category. Pick Whisper on Groq if you want the cheapest OpenAI-compatible endpoint with competent quality; pick Voxtral or Scribe v2 if accuracy is your binding constraint.
Because the model weights are only one part of what determines accuracy. The inference stack — how the provider batches audio, whether they run voice-activity detection before decoding, whether they enable Whisper’s fallback re-decoding loop, what hardware they run on, and whether they set condition_on_previous_text to true or false — all matter enormously. On Artificial Analysis, identical Whisper Large v3 weights score 4.2% on fal.ai, 4.8% on Groq and Fireworks, 7.4% on Together.ai, and 10.2% on Replicate’s default Whisper endpoint. That’s a 2.4x gap on the same weights. The practical answer: if you care about Whisper quality, test the specific provider you’re planning to use, don’t assume the number from one host applies to another.
Only some of them. Deepgram Nova-3 (with Flux for turn detection) is the voice-agent market incumbent for a reason — streaming latency, WebSocket stability, and turn detection are first-class concerns. AssemblyAI Universal-Streaming is a credible alternative with sharper end-of-turn detection in some setups. ElevenLabs Scribe v2 Realtime is accuracy-leading but had integration friction with voice-agent frameworks at launch — verify the current state before committing. Mistral Voxtral Mini Realtime (Feb 2026) is the first credible open-weight option if you need to self-host the streaming piece. Skip anything that’s batch-only: Cohere Transcribe, Canary-Qwen, Whisper (any host), Parakeet, and Gemini 3 Pro aren’t voice-agent options without external work.
It varies widely and you should test on your target language before committing. Whisper claims 99 languages with 2-3x higher WER on non-English at the same quality tier. Voxtral explicitly covers 8 languages plus automatic detection (13 on Voxtral Realtime). Parakeet v3 adds 25 European languages; Parakeet v2 is English-only. Cohere Transcribe covers 14 languages with deliberately strong Asian and Arabic coverage. Speechmatics claims 55+ languages and is the only roundup entry built around multilingual as a primary feature. Gemini and GPT-4o-transcribe handle most major languages well. The rough rule: English-first models can handle major European and Asian languages; low-resource languages work poorly across the board. If you’re building for a non-English market, benchmark on real audio from your target market before picking.
It depends on the provider. Most managed STT APIs retain audio and transcripts for some period — typically 30 to 90 days — for quality monitoring, even after you cancel. Some providers (Deepgram, AssemblyAI, Speechmatics) let you disable retention as an account setting or enterprise option, and will sign a HIPAA BAA on enterprise tiers that guarantees specific retention and deletion rules. For regulated data, read the retention policy and sign the appropriate agreement before sending audio. For maximum control, use an open-weight option (Voxtral, Parakeet, Whisper, Canary-Qwen, Cohere Transcribe) and keep the audio on infrastructure you control.
Batch transcription processes a complete audio file and returns the full transcript at the end. It’s the right choice for podcasts, meeting recordings, video captioning, audiobook transcription — anywhere the audio already exists when you want to transcribe it. Streaming transcription processes audio in real time as it arrives and returns partial transcripts as the speaker talks. It’s what you need for voice agents, live captioning, real-time interview transcription, and any application where a human is waiting for the transcript. Streaming models typically have slightly worse per-word accuracy than their batch counterparts because they can’t use “future” audio context to refine earlier words. If you need both, pick a provider that ships distinct batch and streaming models (AssemblyAI and Deepgram are the cleanest options) rather than forcing one to do both.
We update this guide regularly as new models launch and existing ones evolve. If you’re still unsure, AssemblyAI Universal-3 Pro is the safest starting point for most builders. Questions or suggestions? Let us know.