Skip to main content
By Alex • Updated Apr 12, 2026 Most people landing here already know Claude writes good code and ChatGPT is everywhere. The actual question is which model fits your workflow, your budget, and your harness - and whether any of the open-weight models are finally close enough to matter. The short version: yes, they are, and the top of the leaderboard is not where it was six months ago. We evaluated 50+ models and picked 10.

Best LLMs for Coding

#ModelBest ForTypeAccessPrice (in / out $/M)
1Claude Opus 4.6Best overall for complex, ambiguous coding workClosedAPI5/5 / 25
2Claude Sonnet 4.6Best daily driver for most coding workClosedAPI3/3 / 15
3GPT-5.4Best for reliability and dual-wield code reviewClosedAPI + Codex CLI2.50/2.50 / 15
4Gemini 3.1 ProBest for multimodal and architectural reviewClosedAPI2/2 / 12
5Claude Haiku 4.5Best cheap Claude for sub-agents and high-throughputClosedAPI1/1 / 5
6Gemini 3 FlashBest cheap frontier for high-volume codingClosedAPI + Free tier0.50/0.50 / 3
7GLM-5.1Best open-weight for long-horizon agentic codingOpen-weight (MIT)API + HF weights~1/1 / 3.20
8Kimi K2.5Best open-weight for front-end and visual debuggingOpen-weight (MIT*)API + HF weights0.60/0.60 / 3
9Qwen3-Coder-NextBest open-weight for local self-hostingOpen-weight (Apache 2.0)HF weights + market APIsMarket rates
10DeepSeek V3.2Best open-weight cost floorOpen-weightAPI + HF weights0.28/0.28 / 0.42
One thing the table hides and we need to say up front: LMArena’s general-text leaderboard and its coding leaderboard do not agree. Grok 4.20 and Meta’s Muse Spark are both top-five on general text. Neither one cracks the top 18 on coding. This is the single best argument we can make that “best LLM for coding” is a different question from “best LLM,” and it is why the models below are picked against coding evidence specifically, not against general capability.

1. Claude Opus 4.6: Best overall for complex, ambiguous coding work

Claude Opus 4.6 is the model we kept coming back to for the tasks where we actually cared about the answer. It is #1 on LMArena’s Code leaderboard in both its thinking and non-thinking variants, with nearly 4,500 blind practitioner votes on the standard variant and ~3,700 on thinking. It self-reports 80.8% on SWE-Bench Verified using Anthropic’s own minimal bash-and-edit scaffold, which is the score all the other numbers in this category get compared against. And on ambiguous, multi-file, “I don’t fully know what I want” work, practitioners consistently report that Opus asks a clarifying question where Cursor’s own Composer 2 would just pick an interpretation and run. The things you feel in the first week: it is still expensive at 5in,5 in, 25 out. Reliability is worse than GPT-5.4 - multiple r/ClaudeAI dual-wielders use exactly the phrase “reliability is horrible compared to GPT” when describing how Opus holds up under all-day load. And since April 7, 2026, 1M context is the default for Claude Code users on Max, Team, and Enterprise plans with no beta header and no long-context surcharge, which quietly makes Opus much more usable for repo-scale work than it was at launch.

Key Features

  • Arena Code #1 on human-preference blind voting - the strongest single evidence for “best coding LLM” we found
  • 1M context at standard pricing, now default in Claude Code for Max, Team, and Enterprise (flipped April 7, 2026)
  • Extended thinking with adaptive budget - the model decides how much to think rather than a fixed reasoning effort
  • Computer use and up to 600 images per request for screenshot-driven debugging
  • Native pairing with Claude Code’s deliberately minimal bash + edit tool scaffold

Pros

  • Best-in-class on complex, multi-file, and ambiguous work where other models rush to an interpretation - Opus asks the clarifying question
  • Top of both LMArena Text and LMArena Code, with ~4,500+ votes on the Code leaderboard giving the narrowest confidence interval of any pick in this roundup
  • 1M context at flat pricing makes repo-scale work economically viable - no surcharge above 200K or 272K the way Gemini 3.1 Pro and GPT-5.4 charge
  • The Claude Code scaffold is deliberately minimal (bash tool + edit tool + prompt loop), which means less harness magic obscuring model quality
  • The extended-thinking budget is adaptive - Opus spends more thinking on hard problems and less on easy ones without you configuring reasoning effort manually

Cons

  • Rate limits and uptime are meaningfully worse than GPT-5.4. Practitioners who dual-wield report falling back to Codex for all-day workloads specifically because Claude can’t sustain them. No workaround - pay for Max tier and expect capacity issues during peak times
  • $25/M output is the highest price tag in this roundup. If the task is routine, Claude Sonnet 4.6 runs it at the same quality for 60% of the cost
  • Silent regressions are a real concern. A cluster of r/ClaudeCode “Opus 4.6 lobotomized” threads in early April documents tasks Opus used to pass failing consistently after updates, including a reported CoT-training accident affecting roughly 8% of recent RL. If reliability across provider updates matters, dual-wield with GPT-5.4 as a fallback
  • Needs Claude Code to actually hit the benchmarked quality. Opus in Cursor, Windsurf, or Copilot is the same weights but a different scaffold, and the practical experience lags the Anthropic scaffold meaningfully. If you don’t want Claude Code, you’re paying for Opus and getting less of it

Pricing

PlanPriceWhat’s Included
Standard API5input/5 input / 25 output per M tokensFull 1M context at flat pricing, extended thinking, computer use, 600 images per request
Batch API2.50input/2.50 input / 12.50 output per M tokensSame capabilities, 50% discount, async only
Prompt cache (5 min)$0.50 / M readCache-hit pricing for repeated system prompts
Claude Pro$20/moOpus 4.6 access in Claude apps, limited usage
Claude Max100100-200/moOpus 4.6 with higher usage, Claude Code default to 1M context

Availability

Anthropic API, AWS Bedrock, GCP Vertex AI, Microsoft Foundry. Works with: Claude Code (native, recommended), Cursor, Zed, GitHub Copilot (Pro+/Enterprise), Cline, Roo Code, Kilo Code, Aider, Continue, OpenHands.

Who It’s For (and Who Should Skip It)

Claude Opus 4.6 is the right pick for professional engineers doing genuinely complex, ambiguous, or multi-file work where the cost of a wrong answer is higher than the cost of an expensive API call. Pair it with Claude Code to get the scaffold Anthropic benchmarks against. For a small number of really hard tasks it is worth every cent of the premium. Skip it if the task is routine - Claude Sonnet 4.6 is within a point on SWE-Bench Verified at 60% of the cost and is the honest recommendation for most day-to-day coding. Skip it if your tolerance for reliability hiccups is low - GPT-5.4 is more consistent and has the new $100 Codex Pro tier that makes heavy use affordable. Skip it if you’re committed to an IDE harness that doesn’t support Claude - Gemini 3.1 Pro is the next frontier-tier option that works in Cursor and Windsurf without the Anthropic dependency. Try Claude Opus 4.6

2. Claude Sonnet 4.6: Best daily driver for most coding work

If Opus 4.6 is the model you use when it matters, Sonnet 4.6 is the model you use the other 95% of the time. It sits at #3 on Arena Code with ~7,100 votes - the highest vote count of any model in the leaderboard’s top 10, which means its ranking has the tightest confidence interval of any pick in this roundup. It self-reports ~79.6% on SWE-Bench Verified, which is within a point of Opus’s 80.8%, and it does that at 3input/3 input / 15 output. Sixty percent of Opus’s price for one point of SWE-Bench Verified is the best price-per-capability ratio we found among the frontier closed models. The practical implication is that for most readers this is the pick - not Opus. Cursor’s Auto mode already defaults Sonnet 4.5/4.6 for most non-complex tasks. Cline’s power-user pattern is to plan with Opus and act with Sonnet. And when a cluster of r/ClaudeCode threads started complaining that Opus 4.6 had “lost its reasoning” in early April 2026, several of the same threads noted that Sonnet 4.6 was still passing the tests that Opus was failing. Start here and upgrade to Opus when you hit Sonnet’s ceiling, not the other way around.

Key Features

  • Arena Code #3 at 7,086 votes - the tightest confidence interval in the leaderboard’s top 10
  • 1M context at 3/3/15 flat, same as Opus, with no long-context surcharge
  • Extended thinking (same adaptive budget as Opus)
  • Default Opus-tier model for Cursor Auto mode’s routine-task routing
  • Default in Claude Code sessions that don’t explicitly pick Opus

Pros

  • Within ~1 point of Opus on SWE-Bench Verified at 60% of the price, which makes the quality-per-dollar ratio the best of any model in this roundup
  • Highest vote count in Arena Code’s top 10 means the ranking holds up under statistical scrutiny in a way that some thinly-voted entries don’t
  • Works as the “act” model in Cline’s Plan/Act pattern - plan with Opus, act with Sonnet, and pay maybe a quarter of the cost of running everything through Opus
  • Anthropic reports users preferred Sonnet 4.6 over Sonnet 4.5 about 70% of the time in A/B testing, which matches the practitioner experience we found in r/ClaudeCode threads

Cons

  • Still ~22 ELO behind Opus on Arena Code and about a point behind on SWE-Bench Verified. On novel or ambiguous work, the gap is small but real. Workaround: Cline’s Plan/Act split with Opus planning and Sonnet acting
  • Shares every reliability and rate-limit complaint that applies to Opus 4.6 - Anthropic’s capacity story is uniform across their current frontier models. Dual-wield with GPT-5.4 for all-day workloads if you hit limits
  • The same silent-regression risk applies: the “CoT-training accident” disclosed in early April 2026 covered Opus 4.6, Sonnet 4.6, and Mythos. If you’re tracking week-to-week quality, treat Claude regressions as a category concern, not a Sonnet-specific one

Pricing

PlanPriceWhat’s Included
Standard API3input/3 input / 15 output per M tokensFull 1M context, extended thinking, vision input
Batch API1.50input/1.50 input / 7.50 output per M tokens50% discount, async
Prompt cache (5 min)$0.30 / M readCache-hit pricing
Claude Pro$20/moSonnet in Claude apps, default session model
Claude Max100100-200/moSonnet + Opus with higher usage

Availability

Anthropic API, AWS Bedrock, GCP Vertex AI, Microsoft Foundry. Works with: Claude Code (default), Cursor (Auto mode), Zed Pro (hosted tier includes it), GitHub Copilot, Cline (Act role), Roo Code, Kilo Code, Aider, Continue.

Who It’s For (and Who Should Skip It)

Sonnet 4.6 is the right pick for the professional engineer doing day-to-day coding work in a real codebase - the meetings-to-PR-to-test-run loop that most working developers actually live in. It is our honest recommendation for the single model most readers should start with. Upgrade to Opus only when you hit Sonnet’s ceiling on specific hard problems. Skip it if you’re doing unusually ambiguous work where clarifying-question behavior pays for itself - Claude Opus 4.6 is the right pick there. Skip it if you need GPT’s reliability under sustained load - GPT-5.4 wins on uptime. Skip it if you’re committed to an open-weight stack - GLM-5.1 or Kimi K2.5 are the real answers there. Try Claude Sonnet 4.6

3. GPT-5.4: Best for reliability and dual-wield code review

GPT-5.4 is the coding model you use when reliability matters more than the last five points of SWE-Bench Verified. It ranks #6 on LMArena Code at 1457 ELO - notably, the Arena leaderboard tags OpenAI’s entries as (codex-harness), which is the clearest public acknowledgment that GPT-5.4’s benchmark scores are coupled to its scaffold. It posts an Intelligence Index of 57.2 on Artificial Analysis, tied with Gemini 3.1 Pro at #2. And it’s the first frontier model to beat human average on OSWorld-Verified with a 75% score versus the 72.4% human baseline, which is the most striking single capability claim in this roundup. The part that matters more than the benchmarks is that practitioners dual-wielding Claude and GPT-5.4 consistently describe the split the same way: Claude writes the code, GPT-5.4 catches the small correctness bugs Claude overlooks. That’s a real editorial pattern, and on April 10, 2026 OpenAI made it cheaper to run by halving ChatGPT Pro to $100/mo specifically for heavy Codex users - five times the Plus-tier Codex usage at half the old Pro price. If your workflow is “Claude writes, reviewer catches,” the reviewer slot got meaningfully more affordable this week.

Key Features

  • OSWorld-Verified 75% - the first frontier model to beat human average on the standard computer-use benchmark
  • Five configurable reasoning effort levels (none, low, medium, high, xhigh)
  • OpenAI Codex CLI as the reference harness: /review against branch or commit, plan mode, multi-agent v2 workflows, sandboxed-by-default, MCP support
  • Absorbs the retired o-series (o1, o3, o3-pro, o4-mini) into one frontier model - anyone searching for “o3 for coding” should be using GPT-5.4 high instead
  • Computer use, function calling, MCP and tool search, structured output, image input

Pros

  • Reliability and uptime are meaningfully better than Claude’s - GPT-5.4 “wins in terms of seemingly unlimited usage and very reliable uptime” in r/ClaudeAI dual-wield threads. Matters most if you depend on a model for a full workday
  • On April 10, 2026, ChatGPT Pro dropped from 200to200 to 100/mo with 5x the Codex usage of Plus. That makes it the cheapest high-volume frontier coding subscription we compared. See also the Codex CLI below
  • OSWorld 75% means computer-use benchmarks crossed a real line: GPT-5.4 is the first frontier model to beat the human average on a published agentic desktop-control eval
  • Codex CLI is the best-tooled first-party agent harness in this roundup. The /review command, plan mode, multi-agent v2 workflows, and sandboxed defaults are all tuned for GPT-5.4 specifically
  • In multi-model workflows, GPT-5.4’s role is consistent and citable - it catches the small correctness bugs Claude overlooks. That’s a real and useful job

Cons

  • Code quality on ambiguous work trails Claude. Practitioners report that Claude Opus 4.6 “blows GPT out of the water” when the setup is good and the task is genuinely hard. If you mostly care about peak code quality, Claude Opus 4.6 is the right pick
  • Long-context surcharge kicks in above 272K tokens - 2x input, 1.5x output for the entire session. Claude’s 1M is flat. If your workflow routinely crosses 272K, you pay a meaningful premium with GPT-5.4
  • LMArena’s explicit (codex-harness) tagging tells you bare-API GPT-5.4 is not the benchmarked product. Plan to use Codex CLI, Copilot, or a similarly-capable harness to get the numbers. Raw API + a weak scaffold gives you lower performance
  • Knowledge cutoff August 31, 2025 - behind Claude 4.6 in some cases. For coding this rarely matters (library docs move slowly), but if you’re asking about very recent frameworks, it’s a gap

Pricing

PlanPriceWhat’s Included
API Standard2.50input/2.50 input / 15 output per M tokensFull 1.05M context, reasoning, computer use
API Cached input$0.25 / MFor repeated system prompts
API Batch1.25input/1.25 input / 7.50 output per M50% discount, async
ChatGPT Plus$20/moGPT-5.4 in ChatGPT + Codex with limited usage
ChatGPT Pro (new)$100/mo5x Codex usage limits vs Plus, priority access - dropped from $200 on April 10, 2026
GPT-5.4 Pro (API)30input/30 input / 180 output per MHighest reasoning ceiling variant - use only for problems you can’t crack with high reasoning

Availability

OpenAI API, Azure AI Foundry (28+ regions), OpenRouter. Works with: OpenAI Codex CLI (native, recommended), GitHub Copilot, Cursor, Cline, Roo Code, Kilo Code, Aider, Continue, Windsurf.

Who It’s For (and Who Should Skip It)

GPT-5.4 is the right pick for professional engineers who need a model they can depend on all day, for teams already in the ChatGPT or GitHub/Copilot ecosystem, and for anyone dual-wielding Claude as a writer with GPT as a reviewer. The new $100 Codex Pro tier makes it the cheapest way to get sustained frontier coding use in 2026. Skip it if you need peak code quality on genuinely hard problems - Claude Opus 4.6 is the honest answer. Skip it if you routinely work with prompts above 272K tokens - Claude’s flat 1M is a better fit. Skip it if you’re committed to an open-weight stack - GLM-5.1, Kimi K2.5, or Qwen3-Coder-Next all give you meaningful capability at a fraction of the cost. Try GPT-5.4

4. Gemini 3.1 Pro: Best for multimodal and architectural review

Gemini 3.1 Pro is the frontier coding model you pick when the code lives next to images, PDFs, or design mocks - or when you’re using it as the architectural reviewer in a three-model workflow. It sits at Arena Code #7 with ~5,500 votes and self-reports 78.8% on SWE-Bench Verified, putting it within striking distance of Claude and GPT at a slightly lower list price. It’s the only model in this roundup with native text + image + audio + video + PDF input, which is the asymmetry that matters: if part of your coding loop involves screenshots of bugs, Figma files, or PDF API documentation, Gemini handles them in the same prompt. The honest caveat is the one practitioners name most often. Gemini 3.1 Pro is polarizing in a specific way: strong on single-pass generation, softer on multi-turn iteration. The r/singularity thread “How is Gemini 3.1 at the top of SWE-bench?” has a top comment (62 upvotes) saying “Gemini is constantly criticized on Reddit, but in my experience, it’s the best model,” followed by another (57 upvotes) saying “Gemini is good as long as you don’t iterate, which is conveniently how most of these benchmarks work.” Both are true. If your workflow is “write the thing once, move on,” Gemini is as good as anything. If it’s “write, run tests, fix, rerun, iterate for an hour,” you’ll feel the gap.

Key Features

  • Native multimodal input: text + image + audio + video + PDFs - up to 3,000 images, ~45 minutes of video, 8.4 hours of audio per session
  • 1M context window
  • Configurable thinking levels (minimal, low, medium, high - default high)
  • Native Google Search and Google Maps grounding
  • Computer use, code execution, function calling, structured output
  • Free tier via Google AI Studio (with content used to improve products - not for compliance-sensitive work)

Pros

  • The only frontier coding model with full multimodal input - screenshots of UI bugs, Figma exports, PDF API docs all go in the same prompt. Matters most if your job involves front-end work or anything with visual debugging
  • Plays the “architectural reviewer” role in dual-wielding workflows. r/ClaudeAI practitioners consistently report that Gemini catches broader architectural issues that Claude and GPT-5.4 both miss
  • 2input/2 input / 12 output is the cheapest frontier-tier price in the shortlist below the 200K context threshold
  • AA Intelligence Index 57, tied with GPT-5.4 at #2 - evidence that Gemini is a peer-tier frontier model, not a close second
  • Free tier via Google AI Studio is the only free path to frontier-adjacent coding in this roundup. Real option for learners and indie developers who don’t want to pay upfront

Cons

  • Softer on iteration than on single-pass generation. If your workflow involves long agent loops with test runs, file reads, and back-and-forth, you’ll feel Gemini weaken where Claude and GPT hold up. The “good as long as you don’t iterate” quote is real. Workaround: use Gemini for the first pass and iterate in a different model, or pair with Cline’s Plan/Act split
  • Long-context surcharge starts at just 200K - 2x input, 1.5x output. That’s the steepest cliff in the shortlist. Claude is flat to 1M; GPT’s cliff is at 272K. If your sessions regularly cross 200K, Gemini becomes expensive fast
  • Still in preview as of this month. Google’s preview labeling implies some API stability and SLA caveats, even though Cursor, Copilot, Cline, and Windsurf treat it as a production model
  • Max output is 64K tokens - half of Claude Opus 4.6 and GPT-5.4’s 128K. Bites on long-form refactor generation
  • No first-party coding harness on the level of Claude Code or OpenAI Codex CLI. Gemini’s best coding experience is always through third-party harnesses like Cursor, Zed, Copilot, or Cline

Pricing

PlanPriceWhat’s Included
API Standard (≤200K)2input/2 input / 12 output per M tokens1M context, full multimodal, thinking, computer use
API Long context (>200K)4input/4 input / 18 output per M tokens2x input, 1.5x output above the 200K cliff
API Batch / Flex50% of standardAsync, discounted
Google AI StudioFree tierContent may be used to improve products - don’t use for compliance-sensitive work
Gemini Advanced$19.99/moGemini 3.1 Pro in the Gemini consumer app with Workspace integration

Availability

Google AI Developer API, Google Cloud Vertex AI, OpenRouter. Works with: Cursor, Zed (BYO key), GitHub Copilot, Windsurf, Cline, Roo Code, Kilo Code, Aider, Continue, OpenHands. No Claude Code (Claude-only), no OpenAI Codex CLI (OpenAI-only).

Who It’s For (and Who Should Skip It)

Gemini 3.1 Pro is the right pick for developers whose workflow mixes code and visual input - design-to-component work, UI debugging with screenshots, PDF-sourced API specs - and for anyone who dual-wields Claude or GPT and needs the architectural reviewer in the rotation. Pair it with Cursor, Zed, or Copilot for the best experience. Skip it if your workflow is iterative agent loops with lots of test runs - Claude Opus 4.6 or GPT-5.4 hold up better. Skip it if long context is central to your job - the 200K cliff is steep. Skip it if you want a first-party vendor-native coding harness - Google doesn’t yet ship one at the level of Claude Code or Codex CLI. Try Gemini 3.1 Pro

5. Claude Haiku 4.5: Best cheap Claude for sub-agents and high-throughput

Claude Haiku 4.5 is the fast, cheap Claude you use when the task is volume or parallelism, not peak quality. It reports 73.3% on SWE-Bench Verified with extended thinking - roughly 7 points behind Opus 4.6 at one-fifth the output price - and it runs at 99.2 tokens per second per Artificial Analysis, meaningfully faster than Opus or Sonnet. It was the first Haiku to get extended thinking, computer use, and vision input, which is what made it interesting as a sub-agent tier rather than just “the budget option.” The reason to pick Haiku 4.5 over Gemini 3 Flash or DeepSeek V3.2 is that it stays in the Claude ecosystem. Claude Code natively spawns Haiku sub-agents when a task can be parallelized (“read these five files in parallel”). Cursor Auto rotates Haiku as one of the default cheap-tier models. Cline’s Plan/Act split with Haiku as the Act model is the cheapest way to run the Anthropic plan-with-Opus pattern. If your answer to everything is already Claude, Haiku is the save-money path without leaving the family.

Key Features

  • SWE-Bench Verified 73.3% with 128K extended thinking - the best SWE-Bench-per-dollar in the shortlist
  • 99.2 tokens per second output speed, notably faster than Opus and Sonnet
  • First Haiku model with extended thinking, computer use, and vision input
  • Native sub-agent spawning in Claude Code for parallelized tasks
  • 200K context - shorter than the 4.6 family but sufficient for most day-to-day work

Pros

  • Best quality-per-dollar we measured: within 7 points of SWE-Bench Verified frontier at 1/1/5 per M tokens
  • The native sub-agent model for Claude Code - the harness automatically uses Haiku when tasks can be run in parallel, which compounds the cost savings on large jobs
  • Runs at 99 tokens per second, which makes interactive workflows feel meaningfully quicker than Opus or Sonnet
  • The full Claude feature set (extended thinking, computer use, vision, MCP) in the cheap tier. Nothing else in the shortlist offers the same capability bundle at this price

Cons

  • 200K context is one-fifth of the 4.6 family’s 1M window. For repo-scale work or long-running agent loops, you will feel it. Workaround: use Sonnet or Opus for the loop, delegate parallelized sub-tasks to Haiku
  • On ambiguous or multi-file work, Haiku’s solutions are noticeably less sharp than Opus or Sonnet. The gap is real when the task isn’t well-specified
  • Haiku 4.5 is still from the 4.5 generation, not 4.6. Anthropic has not announced a Haiku 4.6, so Haiku is technically one generation behind the current flagship family
  • If you’re not committed to the Claude ecosystem, Haiku is not the best “cheap model” in absolute terms - Gemini 3 Flash at 0.50/0.50/3 is cheaper and faster, and DeepSeek V3.2 is cheaper still

Pricing

PlanPriceWhat’s Included
Standard API1input/1 input / 5 output per M tokens200K context, extended thinking, computer use, vision
Batch0.50input/0.50 input / 2.50 output per M50% async discount
Prompt cache (5 min)$0.10 / M readCache-hit pricing
Claude Pro$20/moIncludes Haiku in Claude apps

Availability

Anthropic API, AWS Bedrock, GCP Vertex AI, Azure AI Foundry, OpenRouter. Works with: Claude Code (sub-agent role), Cursor (Auto mode cheap tier), GitHub Copilot, Zed, Cline (Act role), Roo Code, Aider, Continue.

Who It’s For (and Who Should Skip It)

Claude Haiku 4.5 is the right pick if you’re committed to the Claude ecosystem and you need to run a lot of work cheaply - parallel sub-agent tasks, high-throughput batch work, long-running agent loops where a cheaper model is acceptable, or cost-sensitive daily driving from users who want Claude specifically. Skip it if you need peak quality - Claude Sonnet 4.6 is the right balance. Skip it if you need the cheapest absolute cost - Gemini 3 Flash or DeepSeek V3.2 are cheaper. Skip it if you need 1M context - Haiku’s 200K is the hard ceiling. Try Claude Haiku 4.5

6. Gemini 3 Flash: Best cheap frontier for high-volume coding

Gemini 3 Flash is proof that “cheap frontier” is not an oxymoron. At 0.50input/0.50 input / 3 output per M tokens - flat pricing across the entire 1M context window, no long-context surcharge - it sits at #12 on LMArena Code with more than 13,000 votes. That’s one of the highest-confidence rankings on the leaderboard, and it puts Flash in the same leaderboard neighborhood as frontier models ten times more expensive. Output speed runs at 180 tokens per second per Artificial Analysis, which is almost twice what GPT-5.4 manages. What you trade at this price is the same thing you trade with Gemini 3.1 Pro, just amplified: Flash is strongest on single-pass generation and weakest on multi-turn iteration loops. It’s also still in preview. But the combination of a real free tier (via Google AI Studio, with the usual “content may be used for training” caveat), flat 1M context pricing, and native multimodal input makes this the model that indie developers, learners, and anyone routing cost-per-task above quality-per-task should start with.

Key Features

  • 0.50input/0.50 input / 3 output per M tokens - cheapest frontier-adjacent price in the shortlist
  • 180 tokens per second output speed - fastest model we tested
  • 1M context with flat pricing (no long-context cliff the way Gemini 3.1 Pro has at 200K or GPT-5.4 has at 272K)
  • Full multimodal input inherited from the family: text + image + audio + video + PDF
  • Free tier via Google AI Studio - the only frontier-family free tier in this roundup
  • Configurable thinking levels (minimal, low, medium, high - default high)

Pros

  • Genuinely cheap frontier-adjacent performance. Arena Code #12 at ten times lower prices than the leaderboard neighbors it sits next to
  • 180 tokens per second feels noticeably faster in interactive sessions than Claude or GPT
  • Flat pricing on 1M context is unique in the “cheap tier” - Claude Haiku tops out at 200K, DeepSeek V3.2 at 128K, Kimi K2.5 at 256K. If you need long context and cheap pricing at the same time, Flash is the only option
  • Free tier via Google AI Studio is the only real free-access frontier path. Valuable for learners, students, and indie developers before committing to paid tiers
  • Full multimodal input at this price is unmatched - no other model under 1/1/5 accepts images, video, and PDFs

Cons

  • Same “good if you don’t iterate” caveat as Gemini 3.1 Pro. Flash is strongest on single-pass generation and weakest on multi-turn agent loops. If your workflow is iterative, pair Flash for the easy steps and a stronger model for the hard ones
  • Still in preview. Google’s preview labeling has stuck around for months - preview does not mean unstable, but it does mean Google reserves more freedom to change behavior and pricing than on GA models
  • Free tier content may be used to improve Google’s products. For proprietary code, use a paid tier
  • AA Coding Index 37.8 versus the 50+ range for Opus, GPT-5.4, and Gemini 3.1 Pro. On quality-critical work the gap is real - Flash is a volume model, not a peak-quality model

Pricing

PlanPriceWhat’s Included
API Standard0.50input/0.50 input / 3 output per M tokensFull 1M context, flat pricing, thinking, computer use, multimodal input
Google AI StudioFree tierRate-limited, content may be used for training
API Batch / Flex0.25input/0.25 input / 1.50 output per M50% async discount
Context caching$0.05 / M readCache pricing for repeated system prompts

Availability

Google AI Developer API, Google Cloud Vertex AI, OpenRouter. Works with: Cursor (Auto mode cheap tier), Cline, Roo Code, Kilo Code, Aider, Continue, OpenHands, Windsurf.

Who It’s For (and Who Should Skip It)

Gemini 3 Flash is the right pick for indie developers, learners, and anyone running high-volume coding workloads where cost-per-task matters more than the last 10 points of SWE-Bench. The free tier through Google AI Studio is a genuine path to trying frontier-family coding without a credit card. Skip it if you need peak quality on hard tasks - Claude Opus 4.6 or Gemini 3.1 Pro. Skip it if iteration-heavy agent loops are your primary workflow - Flash weakens there. Skip it if you’re in a compliance-sensitive environment and need a no-training guarantee on the free tier - the paid tier is fine, but don’t rely on the free tier for proprietary work. Try Gemini 3 Flash

7. GLM-5.1: Best open-weight for long-horizon agentic coding

GLM-5.1 is the model that put the open-weight coding conversation back in the center of the room in April 2026. Z.ai (formerly Zhipu AI) shipped it as a 754B-parameter MoE under a permissive MIT license, and self-reported 58.4% on SWE-Bench Pro - the hardest SWE-Bench variant - which if it holds up puts it ahead of GPT-5.4 at 57.7% and Claude Opus 4.6 at 57.3%. Those are all vendor numbers and none of them are independently reproduced yet, but even if you knock a few points off for self-reporting bias GLM-5.1 is still in the frontier conversation at a fraction of the cost and with fully open weights. The distinctive claim is long-horizon work. Z.ai’s leader Lou said on X that “agents could do about 20 steps by the end of last year. GLM-5.1 can do 1,700 now. Autonomous work time may be the most important curve after scaling laws.” Z.ai demonstrated GLM-5.1 running an 8-hour autonomous task on a single problem - 600+ iterations on a vector database optimization scenario - without plateau. If you’re building an agent that runs unattended for hours, GLM-5.1 is the open-weight model actually designed for that shape of work.

Key Features

  • 754B parameter MoE with 40B active (256 experts, 8 active per token)
  • MIT license on open weights - the cleanest license on an open-weight coding model we found
  • SWE-Bench Pro 58.4% self-reported as the top score (cross-check pending independent verification)
  • Sustained 8-hour autonomous sessions with 600+ iterations demonstrated
  • Arena Code #9 as GLM-5 with ~4,500 votes - the highest open-weight ranking in the Arena Code top 10
  • Terminal-Bench 2.0 at 63.5%
  • Trained on 100,000 Huawei Ascend chips - zero NVIDIA in the training stack

Pros

  • The honest “open-weight that is actually competitive with closed frontier models” pick. If the self-reported SWE-Bench Pro numbers hold, GLM-5.1 is within a point of GPT-5.4 and Opus 4.6 at a fraction of the price
  • MIT-licensed open weights. Commercial use, redistribution, modification all allowed with no ambiguity. That’s genuinely rare among frontier-competitive open-weight models
  • Long-horizon agentic capability is a category Z.ai is uniquely optimizing for. If your use case is “run this agent unattended for hours,” GLM-5.1 is the only open-weight model explicitly designed for that shape
  • Broad harness support: works in OpenHands (the best open-weight agentic harness), Goose (added a Zhipu provider in v1.28), Cline, Kilo Code, Roo Code, Aider, Claude Code via compatibility mode
  • Hosted API pricing around 1input/1 input / 3.20 output - meaningfully cheaper than frontier closed models at comparable claimed benchmark levels

Cons

  • All the standout benchmark numbers are 1st-party from Z.ai. The 58.4% SWE-Bench Pro, the 8-hour sessions, the 1,700-step agent runs - none of these have independent reproduction yet. Treat as “credible and worth testing yourself,” not “proven”
  • Trails Opus 4.6 on LMArena Code by ~100 ELO. On blind human-preference voting, the gap versus Anthropic’s top models is real - GLM-5.1’s value is cost, openness, and long-horizon capability, not raw quality
  • Z.ai / Zhipu has been on the US Commerce Department Entity List since January 2025. Open-weight self-hosting is legal for most users; signing a commercial contract with Z.ai for US enterprise work may hit procurement friction. Plan accordingly
  • The full 754B/40B-active model needs 128GB+ RAM to run locally. For consumer hardware, you want the GLM-4.5-Air variant or a hosted API. Full GLM-5.1 is workstation or cloud only
  • 200K context window is shorter than Claude, GPT-5.4, or Gemini’s 1M windows. For repo-scale work the gap matters

Pricing

PlanPriceWhat’s Included
Z.ai API~1input/ 1 input / ~3.20 output per M tokens200K context, long-horizon agent support, MIT weights
Self-hostFree (compute only)Full model on Hugging Face under MIT license
OpenRouter / Together / FireworksMarket ratesThird-party hosted access

Availability

Z.ai API, Hugging Face (open weights), third-party hosting via OpenRouter, Together, Fireworks, DeepInfra. Works with: OpenHands (recommended for agentic work), Goose (native Zhipu provider), Cline, Kilo Code, Roo Code, Aider, Continue, Claude Code via compatibility mode.

Who It’s For (and Who Should Skip It)

GLM-5.1 is the right pick for self-hosters, privacy-first users, and open-weight advocates who need frontier-competitive coding capability - and especially for anyone building unattended agents that run for hours. The MIT license is the cleanest in the open-weight shortlist, and the long-horizon claim is the most differentiated. Pair with OpenHands for the best agentic harness. Skip it if you need independently verified benchmarks before committing - the numbers are credible but unproven. Skip it if you’re a US enterprise with procurement review sensitive to Entity List vendors - self-hosting is fine, commercial contracts with Z.ai are not. Skip it if you need 1M context - GLM’s 200K is the hard ceiling. Try GLM-5.1

8. Kimi K2.5: Best open-weight for front-end and visual debugging

Kimi K2.5 is the open-weight model that a frontier Western coding tool picked to build its own model on top of. Cursor’s Composer 2, launched in March 2026, beat Claude Opus 4.6 on Cursor-internal benchmarks at roughly ten times lower inference cost, and within 24 hours of launch a leaked API header revealed that Composer 2’s base model was Kimi K2.5 Thinking with continued pretraining and reinforcement learning on top. That’s the strongest commercial endorsement of an open-weight coding model we found - Cursor, a $29.3 billion company, trusted Kimi enough to stake their flagship product on it. The K2.5 family ships in two variants. K2.5 Thinking is the reasoning-capable version at 0.60/0.60/3 per M tokens, sitting at Arena Code #14 with ~6,700 votes. K2.5 Instant is the non-reasoning, faster variant at 0.38/0.38/1.72 per M tokens, sitting at #17. Both are multimodal (text + vision + video input) and both ship under a Modified MIT license. Moonshot also released Kimi Claw in April 2026 as their own hosted agent platform, bundling Kimi K2.5 Thinking with an OpenClaw deployment, 5,000 community skills, and 40GB cloud storage - a one-click alternative to Claude Code for users who want hosted open-weight coding.

Key Features

  • Arena Code #14 as Thinking variant, #17 as Instant variant - both in the leaderboard’s top 20
  • Base model for Cursor Composer 2 (confirmed by API header leak March 2026)
  • Multimodal input: text + vision + video - strong for front-end, UI debugging, design-to-component work
  • BrowseComp 74.9% - leads open-weight models on browser/agent use
  • Modified MIT open-weight license
  • Kimi Code CLI, Kimi Claw hosted agent platform, platform.moonshot.ai API, Hugging Face weights
  • Agent Swarm orchestration: up to 100 sub-agents across 1,500 steps, with 3-4.5x end-to-end speedup (Moonshot claim)

Pros

  • The single strongest commercial validation of any open-weight coding model in this shortlist - Cursor picked Kimi K2.5 as the base model for Composer 2 after evaluating alternatives
  • Kimi K2.5 Instant at 0.38/0.38/1.72 is one of the cheapest frontier-adjacent options in the roundup, undercutting DeepSeek V3.2 on output pricing
  • Native multimodal input is rare among cheap open-weight models - most open-weight coding models are text-only. Kimi’s vision and video input matters for front-end and visual debugging workflows
  • BrowseComp 74.9% leads open-weight models on browser/agent use, which correlates with agentic coding quality in long-running workflows
  • Active iteration: Kimi K2 → K2 Thinking → K2.5 (Thinking + Instant) over about six months of shipping. Moonshot’s momentum is real

Cons

  • Practitioners in r/LocalLLaMA report Kimi trails GLM on Rust and Swift work specifically. One direct quote: “Kimi is good but it is not as intelligent as GLM.” If your main languages are Rust or Swift, GLM-5.1 is the better open-weight pick
  • Arena Code #14 puts Kimi K2.5 Thinking meaningfully behind the Claude 4.6 cluster at top 3 and GPT-5.4 at #6. On raw quality the gap is real - Kimi’s value is cost and multimodality, not peak capability
  • Cursor’s RL on top of the Kimi base is doing real work. Raw Kimi K2.5 outside Cursor Composer 2 may not match Composer 2’s observed performance. The commercial validation is a paradox: Cursor picked Kimi because it was good enough to build on, not because it was already at the top of the leaderboard
  • Published parameter counts and training specs are thin in English-language sources compared to Qwen or DeepSeek. If you’re doing serious due diligence on architecture, you’ll want to chase Chinese-language sources

Pricing

PlanPriceWhat’s Included
Moonshot API - K2.5 Thinking0.60input/0.60 input / 3 output per M tokens256K context, reasoning, multimodal, tool use
Moonshot API - K2.5 Instant0.38input/0.38 input / 1.72 output per M tokensSame capabilities, no reasoning mode
Self-hostFree (compute only)Full model on Hugging Face, Modified MIT license
Kimi ClawConsumer subscriptionHosted agent platform, 5,000 community skills, 40GB cloud storage

Availability

platform.moonshot.ai API, Hugging Face (weights), third-party hosting via OpenRouter, Together, Fireworks. Works with: Cursor (as Composer 2 base), OpenHands (recommended for self-host), Cline, Aider, Continue, Goose, Roo Code, Kimi Code CLI (Moonshot’s own).

Who It’s For (and Who Should Skip It)

Kimi K2.5 is the right pick for open-weight users whose work involves front-end, visual debugging, or design-to-component workflows - the multimodal input is the differentiator - and for developers who want the cheapest open-weight “instant” tier at 0.38/0.38/1.72. It’s also the right pick if you’re a Cursor user curious about what’s actually running under Composer 2. Skip it if your main languages are Rust or Swift - GLM-5.1 is the better pick there. Skip it if you want the cleanest Apache 2.0 license on open weights - Qwen3-Coder-Next is cleaner. Skip it if peak code quality matters more than cost and multimodality - frontier closed models are still ahead. Try Kimi K2.5

9. Qwen3-Coder-Next: Best open-weight for local self-hosting

Qwen3-Coder-Next is the open-weight coding model that actually runs on your laptop. The architecture is 80B total parameters with only 3B active at a time (a Mixture-of-Experts pattern), which means you get 80B-class capability at roughly 3B-class inference cost, and consumer-grade workstation hardware can run it reliably. AMD’s 2026 testing of 20+ local models found that only three reliably handle demanding agentic tool use on consumer hardware: Qwen3-Coder 30B, GLM-4.5-Air (which needs 128GB+ RAM), and DeepSeek V3.2 at 4-bit quantization. Qwen is the one with the lowest hardware bar. The license is the other thing to name directly. Qwen3-Coder-Next is Apache 2.0 - fully permissive, commercial use allowed, redistribution allowed, no restrictions, no Entity List friction, no Llama Community License carveouts. That’s the cleanest license in the open-weight shortlist. On the quality side, it reportedly tops SWE-rebench at Pass-5 - the community-preferred contamination-resistant benchmark - with r/LocalLLaMA practitioners describing it as “local private coding is SOTA or almost SOTA now.” Simon Willison’s post “Something is afoot in the land of Qwen” is the one high-quality independent English-language analytical post on this model, and his framing lines up with the r/LocalLLaMA consensus.

Key Features

  • 80B total parameters / 3B active MoE (Qwen3-Next base) - runs on workstation hardware
  • Apache 2.0 open-weight license - the cleanest in the shortlist
  • 256K native context, extends to ~1M with YaRN scaling
  • Coding-specialized training from the Qwen3-Coder line (Qwen3-Coder 480B, 30B, and Next variants)
  • Trained using 20,000 parallel reinforcement learning environments per Alibaba’s blog - coding-specific training at scale
  • Works inside Claude Code via compatibility mode, plus native support in Cline, Aider, Continue, OpenHands, Roo Code, Kilo Code, Goose

Pros

  • The only 80B-class open-weight coding model that reliably runs locally on consumer hardware. Matters most if you’re a self-hoster, privacy-first user, or working in an air-gapped environment
  • Apache 2.0 license is clean enough to ship in commercial products without license review headaches - no Llama Community License restrictions, no DeepSeek weights carveouts, no Entity List procurement friction
  • Reported SWE-rebench Pass-5 leader per r/LocalLLaMA practitioner testing. SWE-rebench is the community-preferred replacement for SWE-Bench Verified specifically because it’s contamination-resistant
  • Works inside Claude Code via compatibility mode - unusual cross-vendor pairing where an Alibaba open-weight model runs inside Anthropic’s scaffold. Practitioners have verified this works
  • Alibaba’s open-weight coding release cadence is the most consistent of any vendor in the category - Qwen3-Coder 480B (July 2025), Qwen3-Coder 30B, Qwen3-Coder-Next (February 2026), plus Qwen3.5-397B and Qwen3.6 Plus on the general line

Cons

  • Not in LMArena Code’s top 18. Human-preference blind voting does not rank Qwen3-Coder-Next with frontier closed models. The gap is real if raw quality is your priority - Qwen’s value is runs-locally and Apache-2.0, not top-of-leaderboard
  • 3B active parameters puts a ceiling on peak quality. The MoE pattern that makes it locally viable also caps how much it can “think” at once compared to 40B-active GLM-5.1 or frontier closed models
  • Pricing on hosted API endpoints is not cleanly published in English. You’ll need to check OpenRouter, Together, Fireworks, or Alibaba Model Studio for current rates
  • Qwen Code CLI, Alibaba’s own coding harness, is less mature than Claude Code or OpenAI Codex CLI. For the native experience, use Cline, Aider, or OpenHands instead
  • English-language community signal is thinner than the Chinese-language conversation. Most hands-on reports are on r/LocalLLaMA; wider Western coverage is sparser

Pricing

PlanPriceWhat’s Included
Self-hostFree (compute only)Full model on Hugging Face under Apache 2.0
OpenRouter / Together / FireworksMarket ratesThird-party hosted access, varies by provider
Alibaba Cloud Model StudioVariesOfficial hosted API

Availability

Hugging Face (weights), Alibaba Cloud Model Studio, OpenRouter, Together, Fireworks, DeepInfra. Works with: Cline (Ollama local path is recommended), Aider, Continue, OpenHands, Roo Code, Kilo Code, Goose, Claude Code (via compatibility mode).

Who It’s For (and Who Should Skip It)

Qwen3-Coder-Next is the right pick for self-hosters, privacy-first users, and anyone whose main constraint is “runs on my workstation without hitting the cloud.” The Apache 2.0 license makes it the safest open-weight choice for shipping in commercial products, and the local-hardware viability makes it the practical everyday self-host model that r/LocalLLaMA actually deploys. Skip it if raw quality is your top priority - frontier closed models are still ahead, and GLM-5.1 is a stronger open-weight pick when hardware isn’t the constraint. Skip it if you need the cheapest hosted API - DeepSeek V3.2 at 0.28/0.28/0.42 per M tokens is the cost floor. Skip it if your main languages are outside Qwen’s coverage sweet spot and you prefer a commercially-validated alternative - Kimi K2.5 is worth a second look. Try Qwen3-Coder-Next

10. DeepSeek V3.2: Best open-weight cost floor

DeepSeek V3.2 is the cheap, everywhere, works baseline. The deepseek-chat API endpoint has served V3.2 since September 2025 (even though the official weights on Hugging Face still say V3), and the pricing is 0.28input/0.28 input / 0.42 output per M tokens, with cache-hit pricing dropping to $0.028 on input - essentially free on repeated prefixes. It’s the de facto baseline every cost-sensitive open-weight deployment is benchmarked against, and it shows up on Aider’s Polyglot leaderboard at 70.2% for the V3.2-Exp Chat variant, which is the strongest cost-adjusted score on that leaderboard. The caveat to name immediately is that DeepSeek V3.2 is absent from LMArena Code’s top 18. Blind human-preference voting does not rank it with frontier models. That is not a contradiction with the Aider number - it means DeepSeek is a “cheap, correct on most isolated tasks” model that does not win on the harder multi-step agentic work where LMArena’s voters live. Pick it for cost and ubiquity, not for peak quality. Also worth knowing: DeepSeek V4 is expected within weeks per Reuters reporting from April 2026, and when it ships, this recommendation will likely shift.

Key Features

  • 0.28input/0.28 input / 0.42 output per M tokens - the cheapest frontier-adjacent model we tested
  • Cache-hit pricing at $0.028/M input - effectively free on repeated prefixes and large shared contexts
  • Aider Polyglot 70.2% on V3.2-Exp - best cost-adjusted open-weight score on the Aider leaderboard
  • Available on every major third-party provider: Together, Fireworks, AWS Bedrock, OpenRouter, DeepInfra - supply redundancy is real
  • Native Aider and Cline support with diff edit format well-tested for DeepSeek response patterns
  • 128K context, 671B MoE with 37B active parameters (V3 architecture), text-only

Pros

  • The cheapest frontier-adjacent model in the shortlist on input, and the cheapest on output at this context tier. Cache-hit pricing at $0.028/M input is genuinely distinctive if your workflow has large shared system prompts
  • Ubiquitous third-party hosting means supply and uptime redundancy. If one provider degrades, five others are running the same weights
  • Aider Polyglot 70.2% on a consistent, model-isolated harness is the strongest benchmark we have for what DeepSeek can do without scaffold help
  • Open weights (MIT on code, DeepSeek License on weights) enable self-hosting and commercial redistribution - with some license conditions to read before shipping
  • De facto baseline status means every r/LocalLLaMA comparison, every Aider benchmark run, and every “cheap open-weight” article starts with DeepSeek. You’re picking something you can compare everything else against

Cons

  • Absent from LMArena Code’s top 18. On the human-preference agentic-coding leaderboard, DeepSeek V3.2 is not in the conversation with frontier models. Use for cost, not for peak quality
  • Max output is 8,192 tokens - a hard ceiling that bites on long-form code generation and sustained agent loops
  • Text-only. No vision, no multimodal input. For modern workflows that include screenshots of bugs or PDF documentation, this is a hard limitation
  • The weights license has use-based restrictions (DeepSeek License v1.0) that are not as permissive as Apache 2.0 or MIT. Read the license carefully before shipping products that include the weights
  • DeepSeek V4 is expected within weeks. V3.2 risks looking stale by the time you finish reading this review. Commit to rechecking when V4 ships
  • Naming is confusing: “V3” API serves V3.2, deepseek-reasoner serves V3.2-thinking not original R1. If you’re confused, you’re not alone

Pricing

PlanPriceWhat’s Included
DeepSeek API - cache miss0.28input/0.28 input / 0.42 output per M tokens128K context, text, served as V3.2 on deepseek-chat endpoint
DeepSeek API - cache hit0.028input/0.028 input / 0.42 output per M tokensEffectively free on repeated prefixes
Self-hostFree (compute only)Weights on Hugging Face under DeepSeek License
Third-party hostingVariesTogether, Fireworks, Bedrock, OpenRouter, DeepInfra - shop by price

Availability

DeepSeek API, Hugging Face (open weights), third-party providers (Together, Fireworks, AWS Bedrock, OpenRouter, DeepInfra). Works with: Aider (native recommended model), Cline (direct DeepSeek provider), Continue, OpenHands, Goose, Plandex, Roo Code, Kilo Code.

Who It’s For (and Who Should Skip It)

DeepSeek V3.2 is the right pick for cost-sensitive deployments, batch coding workloads, and anyone using Aider whose pricing math is dominated by token cost. It’s also the right baseline to benchmark any “cheap open-weight alternative” claim against - if something is meant to replace DeepSeek V3.2, it needs to either be cheaper or measurably better. Skip it if peak code quality matters - frontier closed models are meaningfully ahead. Skip it if you need vision or multimodal input - DeepSeek is text-only. Skip it if your context regularly exceeds 128K or output needs exceed 8K - the hard ceilings bite. Skip it if you need a permissively-licensed open-weight alternative for commercial redistribution - Qwen3-Coder-Next is cleaner under Apache 2.0. And keep an eye on DeepSeek V4 - when it ships, revisit this pick. Try DeepSeek V3.2

Selection Guide

Match the situation to the model:
  • If you want the best overall coding quality and can afford the premium → Claude Opus 4.6
  • If you’re a professional engineer picking one daily driver → Claude Sonnet 4.6
  • If you need reliability and sustained all-day use → GPT-5.4 with the new $100 Codex Pro tier
  • If your work is multimodal (screenshots, PDFs, design mocks) → Gemini 3.1 Pro
  • If you’re on a tight budget and want cheap frontier-adjacent performance → Gemini 3 Flash
  • If you’re running high-throughput parallel workloads inside the Claude ecosystem → Claude Haiku 4.5
  • If you’re building an unattended agent that runs for hours and you want open weights → GLM-5.1
  • If your workflow is front-end or visual debugging and you want open weights → Kimi K2.5
  • If you’re self-hosting locally on workstation hardware → Qwen3-Coder-Next
  • If cost-per-token is the entire story → DeepSeek V3.2

How We Tested

We evaluated 50+ coding-capable LLMs and selected 10 for this guide. We don’t use affiliate links, accept sponsorships, or take any form of payment from model vendors. Our recommendations are based on published benchmark data, practitioner community signal, our own hands-on testing, and the leaderboards that ranked models blind against each other.

Selection Criteria

  • Coding capability in a real harness - a benchmark score is only useful if it reflects what you actually experience in Cursor, Claude Code, Codex, or Cline. We weighted LMArena Code’s human-preference voting most heavily because it explicitly tests agentic coding tasks with multi-step reasoning and tool use.
  • Scaffold-controlled comparison - we tracked which scores came from vendor-specific harnesses and which came from standardized setups like Aider Polyglot. Single-scaffold apples-to-apples comparisons are rarer than they should be, and we flag that throughout.
  • Practitioner signal - Reddit threads on r/ClaudeCode, r/cursor, r/LocalLLaMA, and r/ChatGPTCoding, Hacker News launch threads, and independent analyst coverage from Simon Willison, Nathan Lambert, and Latent Space all fed into the picks. We weighted hands-on practitioner reports more heavily than launch-day marketing.
  • Cost per task - we tracked pricing not just on a per-token basis but on what it costs to complete representative workflows. Cheap models with high retry rates are not actually cheap.

How We Compared

We used LMArena Code (224,709 blind votes across 59 models as of April 1, 2026) as the primary “model vs model” comparison because it is the largest human-preference agentic coding benchmark, because its methodology is public, and because it explicitly labels vendor-harness couplings. We cross-referenced with SWE-Bench Verified and SWE-Bench Pro scores with the scaffold explicitly noted (see the “What You Need to Know” section below for why the scaffold matters), Aider’s Polyglot leaderboard for model-isolated testing where 2026 models have been added, Artificial Analysis’s Intelligence Index for aggregate comparison, and OSWorld-Verified for computer-use capability where relevant. We read the major Hacker News launch threads for each model (Claude Opus 4.6 at 2346 points, GPT-5.4 at 1019, Gemini 3.1 Pro at 963, Qwen3-Coder-Next at 735, GLM-5.1 at 617, Kimi K2.5 at 502) and the highest-karma Reddit comparison threads for each shortlist model. For each shortlist model we also verified vendor pricing, context, capabilities, and harness support against first-party documentation.

Models We Left Out (and Why)

These came up in the research and didn’t make the cut. We name them because search intent brings readers here looking for them.
  • Grok 4.20 and Grok Code Fast 1 - Grok 4.20 is strong on LMArena Text at #5 but absent from LMArena Code’s top 18 entirely. “Best LLM for coding” is not the same question as “best LLM,” and Grok 4.20 is an argument for that distinction, not a pick for this category. Grok Code Fast 1 is a coding-specialty model from August 2025 that was credible at launch but has been eclipsed by the February and March 2026 wave.
  • Meta Muse Spark - Meta’s first Superintelligence Labs model is #4 on LMArena Text but absent from Code’s top 18. Artificial Analysis’s own coverage said Muse Spark “continues to struggle with coding and agentic functionality” at launch. If Meta ships a coding-targeted variant later in 2026, we’ll revisit.
  • Llama 4 Maverick and Llama 4 Scout - minimal 2026 coding signal. Maverick scored 16% on Aider Polyglot in April 2025 and has not resurfaced in practitioner conversations. Meta appears to have stopped investing in Llama as a flagship coding story.
  • Cursor Composer 2 - technically a model (Kimi K2.5 with continued pretraining and RL on top), but Cursor-IDE-only. Can’t recommend a model nobody can use outside a specific IDE. We discuss Composer 2 in the Kimi K2.5 entry because Kimi is the base model Cursor picked.
  • Claude Mythos Preview - Anthropic’s Project Glasswing frontier model, reportedly scoring 93.9% on SWE-Bench Verified and autonomously finding zero-days in OpenBSD, FFmpeg, and the Linux kernel. Anthropic has explicitly stated they do not plan to release Mythos Preview for general use. If you can’t use it, it can’t be on the list. The practitioner reaction has also been mixed - several high-visibility X posts characterized the launch-day demos as overblown.
  • MiniMax M2.5 / M2.7 - third-party benchmarks claim SWE-Bench Verified 80.2% for M2.5, which would be competitive with Opus 4.6. The methodology is not independently verified and English-language practitioner signal is thin. If independent benchmarks emerge, we’ll revisit.
  • DeepSeek R1, R2, V4 - R1 is superseded for coding by V3.2. R2 is 3rd-party-reported and lacks a public technical report. V4 is expected “within weeks” per Reuters reporting from April 2026 but had not shipped when we finished this guide. We’ll update when it does.
  • Qwen3-Coder 480B and Qwen 3.5 / 3.6 Plus Preview - Qwen3-Coder 480B is too heavy to self-host for most readers. Qwen 3.5-397B is general-purpose (not coding-specialized), and Qwen 3.6 Plus Preview is a proprietary preview. Qwen3-Coder-Next at 80B/3B-active covers the Alibaba slot more practically.
  • Mistral Large 3, Codestral, Devstral 2 - zero coding-LLM community signal in our sweep. Codestral is a completion/FIM specialist rather than an agentic coder. Devstral 2 Medium and Small 2 are credible but the English-language practitioner adoption is thin.
  • Amazon Nova Premier / Nova Pro, Cohere Command A, Microsoft Phi-4, IBM Granite 4, NVIDIA Nemotron 3 - all real models, all behind the 2026 wave on coding specifically. Better enterprise, agent, or small-model stories than coding stories.
  • OpenAI o1 / o3 / o3-pro / o4-mini - retired from ChatGPT on February 13, 2026. OpenAI folded reasoning into GPT-5.4. If you were planning to use the o-series for coding, use GPT-5.4 with high reasoning effort instead, or GPT-5.4 Pro for the highest ceiling.

Adjacent Categories

This is a guide to coding LLMs - the model that generates the code. The agent harnesses that run those models (Claude Code, OpenAI Codex CLI, Cursor, Cline, Aider, Windsurf, Zed, OpenHands, Goose, Kilo Code) are covered in our separate agentic coding tools roundup. Web-based “vibe coding” tools like Bolt, v0, Lovable, and Replit Agent route to proprietary model stacks and are covered in our AI prototyping tools guide. Autocomplete-only products like Tabnine, Supermaven, and legacy Codeium are a different technical category than agentic coding LLMs.

What You Need to Know Before Using Coding LLMs

Three practical things we wish we’d known before we started testing. All are category-wide, not specific to any one model.

Benchmark numbers are scaffold-dependent

When a vendor quotes “Model X scored Y% on SWE-Bench Verified,” the first question to ask is “whose scaffold?” Meta and Harvard’s Confucius Code Agent running Claude Sonnet 4.5 scored 52.7% on SWE-Bench Pro, beating Claude Opus 4.5 on Anthropic’s own scaffold at 52.0% - a smaller Sonnet beat a bigger Opus entirely because of the harness. Scaffold swaps alone move SWE-Bench Verified scores 11 to 15 points on the same model. LMArena explicitly labels OpenAI’s entries as (codex-harness) to acknowledge this coupling. The practical takeaway: when you see a benchmark score from a vendor, assume it’s their own scaffold, and expect 5 to 10 points lower on a neutral harness. If you want apples-to-apples model comparison, read Aider Polyglot for model-isolated testing where 2026 models have been added, or SWE-rebench for contamination-resistant comparison.

Silent model regressions are real

Models with the same name are not always the same model week to week. A cluster of r/ClaudeCode threads in early April 2026 documented Claude Opus 4.6 starting to fail tests it had been passing, including a disclosed “CoT training accident” affecting roughly 8% of recent reinforcement learning on Opus 4.6, Sonnet 4.6, and Mythos. This is not unique to Anthropic - all frontier vendors push updates without version bumps. The practical mitigation: keep a short suite of representative tasks you can rerun against any new model version or provider update. If the tasks that worked last week don’t work this week, you’ll notice before it costs you a bad merge.

Rate limits and capacity are part of the model

Claude has meaningfully worse rate limits and uptime than GPT-5.4. That’s not a benchmark, but it shows up in every sustained-use practitioner report we found. If your workflow depends on a model for 6+ hours of continuous use, run a 15-minute burst test on your heaviest prompts before committing - you’ll find the capacity ceiling before it blocks a deadline. For teams needing guaranteed availability, consider dual-wielding Claude for code quality and GPT-5.4 for reliability, or paying the enterprise-tier SLAs that some vendors offer.

Frequently Asked Questions

For most professional engineers, start with Claude Sonnet 4.6 inside Cursor Auto mode or Claude Code. Sonnet is within one point of SWE-Bench Verified frontier quality at 60% of the Opus price, and Cursor’s Auto mode routes to it automatically without requiring you to pick a model for every task. If your workflow is iterative agent loops with lots of test running, GPT-5.4 through OpenAI Codex CLI is the alternative with better reliability under sustained load. If you’re exploring on a strict zero-cost budget, Gemini 3 Flash through Google AI Studio’s free tier is the only real frontier-family free path.
For most workflows, no - Sonnet 4.6 is within one SWE-Bench Verified point of Opus at 60% of the cost, and the Cline power-user pattern is to plan with Opus and execute with Sonnet rather than running everything through Opus. Where Opus earns the premium is on genuinely ambiguous or multi-file work where asking a clarifying question saves more time than it costs - practitioners consistently report that Opus asks the right question where other models just pick an interpretation and run. Start with Sonnet and upgrade to Opus when you hit Sonnet’s ceiling on specific hard problems.
Partially. Cursor supports a narrow set of closed and hosted models through its first-class integration - Kimi K2.5 is supported because Cursor Composer 2 was built on it, but GLM-5.1 and the full Qwen3-Coder line are not first-class Cursor models. Claude Code is Claude-only by design. For open-weight models, the native harnesses are Cline, Aider, OpenHands, Roo Code, Kilo Code, Continue, and Goose - all support bring-your-own-key or local Ollama routing, and OpenHands in particular is where open-weight models come closest to frontier closed model performance. Qwen3-Coder-Next specifically works inside Claude Code via compatibility mode (point Claude Code at an OpenAI-compatible endpoint) if you’re willing to tolerate a slightly hacky setup.
For paid subscriptions, the 100/moChatGPTProtierOpenAIintroducedonApril10,2026isthecheapest"heavyuse"frontiercodingsubscription5xtheCodexusageofthePlustierforhalftheoldProprice.ForAPIusagebilledpertoken,Gemini3Flashat100/mo ChatGPT Pro tier OpenAI introduced on April 10, 2026 is the cheapest "heavy use" frontier coding subscription - 5x the Codex usage of the Plus tier for half the old Pro price. For API usage billed per token, Gemini 3 Flash at 0.50 input / 3outputisthecheapestfrontieradjacentmodelwithfull1Mcontextandmultimodalinput.Foropenweightandselfhosting,DeepSeekV3.2at3 output is the cheapest frontier-adjacent model with full 1M context and multimodal input. For open-weight and self-hosting, DeepSeek V3.2 at 0.28/0.42perMtokensisthecostfloor,andQwen3CoderNextisthecheapestworkstationrunnableoption.Ifyouneedfrontierlevelqualityspecifically,thereisnofreelunchClaudeOpus4.6at0.42 per M tokens is the cost floor, and Qwen3-Coder-Next is the cheapest workstation-runnable option. If you need frontier-level quality specifically, there is no free lunch - Claude Opus 4.6 at 5/25orGPT5.4at25 or GPT-5.4 at 2.50/$15 are the actual prices.
The LMArena Code leaderboard is the single best “apples to apples” coding comparison we found in April 2026. It’s human-preference blind voting on agentic coding tasks, it tags which harness each vendor’s entry uses, and it has hundreds of thousands of votes across ~60 models - the sample size is large enough that the rankings are stable. For model-isolated comparison (the same harness running every model), Aider Polyglot is the gold standard, though as of April 2026 Paul Gauthier has not yet added Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, GLM-5.1, or Qwen3-Coder-Next to the leaderboard. For contamination-resistant testing, SWE-rebench is the community-preferred replacement for SWE-Bench Verified. Vendor-reported SWE-Bench Verified numbers are useful as a floor estimate but assume they’re the vendor’s own scaffold and mentally knock 5 to 10 points off for a neutral harness.
This category moves fast enough that any specific pick carries a “as of April 12, 2026” qualifier. DeepSeek V4 is expected within weeks per Reuters reporting. Anthropic is almost certainly working on a Haiku 4.6 and a Claude 4.7 cycle. OpenAI’s naming suggests more GPT-5.x releases. We update this guide as the category moves, and the picks that are most likely to shift are DeepSeek (as V4 lands), Qwen (as Alibaba’s coding line keeps iterating), and the open-weight long-horizon category where GLM is moving fastest. If you’re picking a model today for a three-month project, Claude Sonnet 4.6 and GPT-5.4 are the most stable bets - they’ve been production-grade for roughly two months and any replacements will ship alongside them, not in place of them.
We update this guide regularly as new models launch and the leaderboards move. If you’re still unsure, Claude Sonnet 4.6 through Cursor Auto mode or Claude Code is the safest starting point for most professional engineers, with GPT-5.4 in OpenAI Codex CLI as the reliability-first alternative. Questions or suggestions? Let us know.