Requesty

AI model rankings

Ranked lists of the best AI models across the dimensions that actually matter — coding ability, reasoning, math, price, and context window. Each list is backed by published benchmarks or live pricing data.

Smartest overall

View →

Ranked by Intelligence Index

The Artificial Analysis Intelligence Index combines many independent evaluations (reasoning, coding, math, science, agentic tool use) into a single capability score. It is the best at-a-glance signal for raw model intelligence — but always validate on your own workload, since the best model for a specific task is often not the highest overall.

Best for coding

View →

Ranked by Coding Index

The Coding Index blends multiple coding evaluations — contamination-free code generation (LiveCodeBench), research-level scientific coding (SciCode), and agentic terminal tasks (Terminal-Bench). It is a broader, harder-to-game signal than any single coding benchmark.

Best coding agent

View →

Ranked by Terminal-Bench Hard

Terminal-Bench Hard measures how well a model operates as a coding agent in a real terminal — running commands, editing files, and fixing repositories end-to-end. It is the closest proxy to how models perform inside tools like Claude Code, Cursor and Codex.

Best for reasoning

View →

Ranked by GPQA Diamond

GPQA Diamond is a set of graduate-level science questions written by domain experts and filtered so that PhD students with internet access still struggle. It's the most reliable signal we have for "does this model actually reason" vs "is it pattern-matching training data".

Best at math

View →

Ranked by Math Index

The Math Index aggregates competition and advanced math evaluations (including AIME). These problems require real symbolic reasoning across multiple novel steps — memorization gets a model nowhere.

Best for tool use

View →

Ranked by τ²-Bench

τ²-Bench measures multi-turn agentic tool use: calling functions, following policies, and completing realistic tasks over many turns. If you are building agents or tool-calling workflows, this predicts real-world reliability better than single-shot benchmarks.

Best for knowledge

View →

Ranked by MMLU Pro

MMLU Pro tests broad knowledge across academic and professional subjects with harder, reasoning-heavy questions than the original MMLU. A strong score indicates wide, reliable factual coverage.

Cheapest

View →

Lowest input + output price per 1M tokens

Ranked by combined input + output price per million tokens (excluding free-tier models). These are production-ready models that punch well above their price point — great defaults when cost matters and you can test model quality on your own workload.

Longest context

View →

Max tokens in a single prompt

A larger context window means more tokens you can fit in a single prompt — useful for whole-codebase analysis, long document Q&A, and agentic workflows. Note: effective quality often degrades past 128K tokens; prompt caching (supported on many models) is usually a better approach for repeated long context than brute-forcing more tokens in every call.