Best AI models for agentic coding
Terminal-Bench Hard measures how well a model operates as a coding agent in a real terminal — running commands, editing files, and fixing repositories end-to-end. It is the closest proxy to how models perform inside tools like Claude Code, Cursor and Codex.
- 🥇
gpt-5.5OpenAI Inc.·$5.00 / $30.00 per 1M60.6%60.6% - 🥈claude-opus-4-8Anthropic PBC·$5.00 / $25.00 per 1M58.3%58.3%
- 🥉
gpt-5.4OpenAI Inc.·$2.50 / $15.00 per 1M57.6%57.6% - 4
gemini-3.1-pro-previewGoogle LLC (Gemini API)·$2.00 / $12.00 per 1M53.8%53.8% - 5claude-sonnet-4-6Anthropic PBC·$3.00 / $15.00 per 1M53.0%53.0%
- 6
gpt-5.3-codexOpenAI Responses·$1.75 / $14.00 per 1M53.0%53.0% - 7
gpt-5.4-miniOpenAI Inc.·$0.75 / $4.50 per 1M52.3%52.3% - 8claude-opus-4-7Anthropic PBC·$5.00 / $25.00 per 1M51.5%51.5%
- 9
qwen3.7-maxAlibaba Cloud·$2.50 / $7.50 per 1M50.8%50.8% - 10claude-opus-4-5Anthropic PBC·$5.00 / $25.00 per 1M47.0%47.0%
- 11
gpt-5.2-chatOpenAI Inc.·$1.75 / $14.00 per 1M47.0%47.0% - 12claude-opus-4-6Anthropic PBC·$5.00 / $25.00 per 1M46.2%46.2%
- 13
deepseek-v4-proDeepSeek·$0.43 / $0.87 per 1M46.2%46.2% - 14
gpt-5.1OpenAI Inc.·$1.25 / $10.00 per 1M45.5%45.5% - 15
kimi-k2.6Moonshot AI·$0.95 / $4.00 per 1M43.9%43.9% - 16
qwen3.6-plusAlibaba Cloud·$0.50 / $3.00 per 1M43.9%43.9% - 17
GLM-5.1Z AI·$1.40 / $4.40 per 1M43.2%43.2% - 18
GLM-5Z AI·$1.00 / $3.20 per 1M43.2%43.2% - 19
XiaomiMiMo/MiMo-V2.5-ProDeepInfra Inc.·$1.00 / $3.00 per 1M43.2%43.2% - 20
gpt-5.4-nanoOpenAI Inc.·$0.20 / $1.25 per 1M42.4%42.4% - 21minimax-m3MiniMax·$0.30 / $1.20 per 1M42.4%42.4%
- 22
gemini-3-pro-previewGoogle LLC (Gemini API)·$2.00 / $12.00 per 1M41.7%41.7% - 23
gemini-3.5-flashGoogle LLC (Vertex AI)·$1.50 / $9.00 per 1M40.9%40.9% - 24
qwen/qwen3.5-397b-a17bNovita AI·$0.60 / $3.60 per 1M40.9%40.9% - 25
xiaomimimo/mimo-v2-proNovita AI·$2.00 / $6.00 per 1M40.9%40.9% - 26MiniMax-M2.7MiniMax·$0.30 / $1.20 per 1M39.4%39.4%
- 27
gemini-3-flash-previewGoogle LLC (Gemini API)·$0.50 / $3.00 per 1M38.6%38.6% - 28
gpt-5-codexOpenAI Responses·$1.25 / $10.00 per 1M37.9%37.9% - 29grok-4xAI Corp.·$3.00 / $15.00 per 1M37.9%37.9%
- 30grok-4.3xAI Corp.·$1.25 / $2.50 per 1M37.9%37.9%
Explore other rankings
How we rank
Scores for Terminal-Bench Hard come from Artificial Analysis, an independent AI benchmarking service. When a model is available through multiple providers (e.g. Anthropic direct, AWS Bedrock, Google Vertex), we show one canonical entry per model family so the ranking isn't polluted by duplicates. Benchmarks measure specific skills — always validate on your own workload before committing.
One API for every model on this list
Requesty is OpenAI-compatible and routes to 400+ models. Switch between any of the models above by changing one parameter in your code.
Get started free