AI model rankings
Ranked lists of the best AI models across the dimensions that actually matter — coding ability, reasoning, math, price, and context window. Each list is backed by published benchmarks or live pricing data.
Smartest overall
View →Ranked by Intelligence Index
The Artificial Analysis Intelligence Index combines many independent evaluations (reasoning, coding, math, science, agentic tool use) into a single capability score. It is the best at-a-glance signal for raw model intelligence — but always validate on your own workload, since the best model for a specific task is often not the highest overall.
Best for coding
View →Ranked by Coding Index
The Coding Index blends multiple coding evaluations — contamination-free code generation (LiveCodeBench), research-level scientific coding (SciCode), and agentic terminal tasks (Terminal-Bench). It is a broader, harder-to-game signal than any single coding benchmark.
Best coding agent
View →Ranked by Terminal-Bench Hard
Terminal-Bench Hard measures how well a model operates as a coding agent in a real terminal — running commands, editing files, and fixing repositories end-to-end. It is the closest proxy to how models perform inside tools like Claude Code, Cursor and Codex.
Best for reasoning
View →Ranked by GPQA Diamond
GPQA Diamond is a set of graduate-level science questions written by domain experts and filtered so that PhD students with internet access still struggle. It's the most reliable signal we have for "does this model actually reason" vs "is it pattern-matching training data".
Best at math
View →Ranked by Math Index
The Math Index aggregates competition and advanced math evaluations (including AIME). These problems require real symbolic reasoning across multiple novel steps — memorization gets a model nowhere.
Best for tool use
View →Ranked by τ²-Bench
τ²-Bench measures multi-turn agentic tool use: calling functions, following policies, and completing realistic tasks over many turns. If you are building agents or tool-calling workflows, this predicts real-world reliability better than single-shot benchmarks.
Best for knowledge
View →Ranked by MMLU Pro
MMLU Pro tests broad knowledge across academic and professional subjects with harder, reasoning-heavy questions than the original MMLU. A strong score indicates wide, reliable factual coverage.
Cheapest
View →Lowest input + output price per 1M tokens
Ranked by combined input + output price per million tokens (excluding free-tier models). These are production-ready models that punch well above their price point — great defaults when cost matters and you can test model quality on your own workload.
Longest context
View →Max tokens in a single prompt
A larger context window means more tokens you can fit in a single prompt — useful for whole-codebase analysis, long document Q&A, and agentic workflows. Note: effective quality often degrades past 128K tokens; prompt caching (supported on many models) is usually a better approach for repeated long context than brute-forcing more tokens in every call.
