LLM Leaderboard 2026

The best large language models, ranked across 5 benchmarks — LMArena ELO, Aider polyglot coding, SWE-bench Verified, OpenRouter real-world usage, and cost-per-quality. Tracking 367 models from 60 providers.

Last updated: 2026-05-08 · Refreshed daily at 07:00 UTC

Top 10 LLMs by Composite Quality Score

#	Model	Quality	Arena	Aider	SWE-bench	OpenRouter 7d	$/M in
1	Google: Gemini 3.1 Flash Lite Google	—	—	—	—	—	$0.25
2	Baidu Qianfan: CoBuddy (free) baidu	—	—	—	—	—	$0.00
3	OpenAI: GPT Chat Latest OpenAI	—	—	—	—	—	$5.00
4	xAI: Grok 4.3 xAI	—	—	—	—	—	$1.25
5	IBM: Granite 4.1 8B ibm-granite	—	—	—	—	—	$0.05
6	Mistral: Mistral Medium 3.5 Mistral	—	—	—	—	—	$1.50
7	Owl Alpha openrouter	—	—	—	—	—	$0.00
8	NVIDIA: Nemotron 3 Nano Omni (free) nvidia	—	—	—	—	—	$0.00
9	Poolside: Laguna XS.2 (free) poolside	—	—	—	—	—	$0.00
10	Poolside: Laguna M.1 (free) poolside	—	—	—	—	—	$0.00

All Models

Sort byProvider

#	Model	Quality	Arena	Aider	SWE	OR 7d	Ctx	$/M in	$/M out	$/Q
Loading…

What We Measure

LMArena ELO

Crowd-sourced head-to-head ratings from lmarena.ai. Real users compare two anonymous model responses and vote — the ELO score reflects which model wins more often across millions of matchups. The closest thing to a "general intelligence" ranking that exists.

Aider Polyglot

Aider's polyglot benchmark tests whether a model can edit code in 6 languages (Python, Go, Rust, JS, C++, Java) without breaking the build. Score is the percent of edits that pass tests on the first try. The most realistic proxy for "will this model actually ship working code."

SWE-bench Verified

Real GitHub issues from open-source repos. The model has to produce a patch that closes the issue and passes the project's own tests. The Verified subset is the human-validated ~500 issues where a competent engineer agrees the issue is well-specified. Hardest of the major coding benchmarks.

OpenRouter Usage

7-day rolling token throughput across the OpenRouter network — the largest neutral LLM gateway. This is a popularity signal, not a quality signal: it reflects which models developers are actually shipping in production.

Composite Quality Score

Each metric is normalized to 0–100, then weighted-averaged: 35% Arena ELO, 25% Aider, 20% SWE-bench, 10% HumanEval, 10% OpenRouter usage. A model only gets a Quality Score if it has data on at least three of the five benchmarks. Cost-per-quality divides Quality by blended price (input + 3× output, since output drives real-world bills).

Frequently Asked Questions

How often is the leaderboard updated?

Daily at 07:00 UTC. OpenRouter usage updates every 24 hours; LMArena ELO usually refreshes 1–2× per week; Aider and SWE-bench update whenever new model results are submitted.

Why is the "best" model different on each benchmark?

Because each benchmark measures a different thing. Arena measures what humans prefer in chat. Aider and SWE-bench measure whether code actually compiles and tests pass. OpenRouter measures what developers ship. A model can dominate one and lag another — this is why a single "best LLM" ranking is misleading.

Where does the data come from?

All sources are public. We pull from openrouter.ai for pricing and usage, lmarena.ai for ELO, the Aider docs for polyglot scores, and swebench.com for SWE-bench Verified.

Which model should I actually use?

For coding agents: sort by Aider or SWE-bench. For chat or writing: sort by Arena. For high-volume production: sort by cost-per-quality. The top of each column is a reasonable default for that workload.

How do you handle new model releases?

New models appear in OpenRouter pricing within hours of launch and show up here on the next daily refresh. Benchmark scores follow once the upstream sites publish results — usually within a week for Arena and 2–4 weeks for Aider and SWE-bench.

LLM Rankings Best LLM Models Browse Tools Blog