What the routing layer measured.
Empirical observations from Requesty's production gateway: latency distributions, caching behaviour, agentic tool-use patterns, and failure taxonomies across every provider we route to. Each dataset ships with methodology, machine-readable exports, and a citation block.
Agentic workloads
11 datasetsfinish_reason mix by provider
Which AI providers serve the most agentic traffic? In April 2026 Anthropic-direct returned `finish_reason = tool_calls` on 52% of successful completions on the Requesty gateway, about 2× the next provider and 17× higher than OpenAI direct. OpenAI Responses (26%), Vertex (Claude) (23%) and Azure (23%) formed a clear second tier. Splitting Vertex into Gemini and Claude cohorts shows the gap inside that route: Vertex (Claude) 23% vs Vertex (Gemini) 13%.
finish_reason by model
Which AI models are used most for tool calling? In April 2026 Claude Opus 4.6 returned `finish_reason = tool_calls` 59% of the time on the Requesty gateway, the most agentic model on the platform. Gemini 2.5 Flash came second at 37%. Same-family Claude Sonnet 4.5 only 9%, and the entire OpenAI lineup (GPT-4o, GPT-4.1-mini, GPT-4.1-nano, GPT-5-mini) sat under 4%.
Token-weighted tool_calls
What share of LLM output tokens is spent on tool calls vs chat? In April 2026 on the Requesty gateway, Anthropic emitted 38.8% of its output tokens on `tool_calls` vs 54.2% of requests, so agentic completions are roughly 30% smaller than chat ones. OpenAI Responses showed the opposite: 34.2% of tokens vs 26.4% of requests. Vertex (Claude) had the biggest negative gap (6.1% of tokens vs 27.6% of requests).
OSS family share
Which open-weight AI model is most popular in 2026? On the Requesty gateway, OSS-routed traffic went from Qwen-dominated in late 2025 (34-38% share in Nov-Dec) to DeepSeek-dominated in January 2026 (77% after the R1 launch), and back to a genuinely diversified state by April (DeepSeek 47%, Kimi 17%, MiniMax 15%). Qwen collapsed from 38% to under 4% almost overnight when DeepSeek R1 shipped.
Reasoning-token share
How much of LLM output is reasoning/thinking tokens? In April 2026 on the Requesty gateway, Groq led at 82%, followed by Coding (79%), xAI (60%) and z.ai (51%). These routes are dominated by thinking models. Frontier routes ran around a third: Vertex (Gemini) 40%, OpenAI 36%, OpenAI Responses 33%. Anthropic and Bedrock report 0% because Anthropic does not surface reasoning tokens separately; extended thinking is delivered inline.
Cost per user by agent
How much does a typical coding agent user spend per month? Across nine agents observed over twelve months through the Requesty gateway, the weighted average rose from $14/month to $54/month ($91 for active users with 2+ active days). Claude Code active users average $108/month (median $23, P95 $296) in April 2026. Roo Code active users spend $79/month, OpenCode $104/month, and Cline $49/month.
Agent cache hit rate
Which coding agents use prompt caching most effectively? In April 2026, Claude Code led at 92% cache hit rate (cached_tokens / input_tokens), followed by OpenCode at 89%. Kilo Code sits at 46% with 62K avg input tokens. The gap is architectural: agents that maintain consistent context prefixes across sequential calls achieve dramatically higher cache reuse.
Agent model share
How much of coding agent spend goes to Claude? In April 2026, Claude models power 79% to 100% of spend across all nine coding agents observed through the Requesty gateway. Claude Code is nearly 100% locked to Claude (expected, as Anthropic's own product). Zed is the most model-diverse at 59% Claude / 41% OpenAI. OpenCode has the highest non-Claude adoption among open-source agents at 13% OpenAI.
Agent finish reasons
How do coding agent API calls end? In April 2026, Roo Code leads with 91% of calls finishing via tool_calls, the primary agentic pattern. Claude Code follows at 73%. Cline (81% stop) and Aider (87% stop) favor single-turn completions. Kilo Code shows 63% tool_calls and 28% stop, a balanced mix of agentic and single-turn patterns.
Agent session depth
How many API calls does a single coding session make? In April 2026, Claude Code has the deepest sessions at 16 median calls per trace and reaches 209 calls at P95, reflecting complex multi-step coding workflows. Roo Code sessions are shallower at 11 median calls but more numerous (6,247 traces vs 594 for Claude Code).
Agent streaming adoption
Do coding agents stream their API responses? In April 2026, most agents stream nearly 100% of calls. Aider is the major outlier at 22% streaming, preferring batch completions. Claude Code streams 93% of calls. Aider also has the highest reasoning token intensity at 82%, suggesting it relies on reasoning models in non-streaming mode.
Latency and performance
6 datasetsLatency leaderboard
Which AI provider has the lowest latency in April 2026? On the Requesty gateway xAI led p50 at 0.6 s, with Novita (0.8 s), Azure (1.0 s) and Mistral (1.4 s) close behind. Vertex (Claude) was the slowest at 13.7 s, 23× the fastest and 2.8× slower than Vertex (Gemini) at 4.9 s on the same Vertex route. Anthropic-direct sat mid-pack at 5.8 s with a 52.6 s p95 long tail.
Throughput density
How many tokens per second can each LLM provider sustain? In April 2026 on the Requesty gateway Groq led at 320 output tok/sec, 2.5× the next-fastest provider, attributable to its custom inference silicon. Vertex (Gemini) was second at 130 tok/sec, Mistral 120 tok/sec; OSS aggregator routes (Nebius, Minimaxi, DeepInfra) clustered at 23-26 tok/sec; Bedrock was slowest at 15 tok/sec, 21× behind Groq.
Streaming TTFT
Which AI provider has the fastest time-to-first-token? In April 2026 on streaming-and-successful Requesty requests, Azure led TTFT at 593 ms with a 960 ms p50 total, the streaming-UX winner on both axes. xAI was among the fastest on total latency (5.68 s) but slowest to first token (3.27 s), which suggests buffered upstream behaviour rather than true streaming. Vertex (Gemini) and Vertex (Claude) sit at very different points: Gemini totals 3.05 s, Claude totals 8.03 s on the same Vertex route.
p50 latency YoY
Has LLM latency improved over the past year? On the Requesty gateway, open-source aggregator routes compressed dramatically between April 2025 and April 2026. xAI fell 93% (9.1 s to 0.6 s), DeepInfra 91% (15.8 s to 1.4 s), DeepSeek 62% (24.3 s to 9.2 s). Frontier providers barely moved (OpenAI -5%, Anthropic 0%). Vertex (Claude) is the only major route that got slower, +131%, as heavy agentic Claude Code workloads landed on it.
Cache hit rate
Which AI providers have the highest prompt-cache hit rate? In April 2026 Anthropic-direct led the Requesty gateway at 77% (cached_tokens / input_tokens), Bedrock Claude was healthy at 57%, and Vertex (Claude) trailed at 24%. Same Claude model family, 3× lower hit rate. Vertex (Gemini) sat at 10% and Mistral at 4%, the floor among major routes.
Claude Code latency by provider
How does Claude Code latency vary by cloud provider? In April 2026, Anthropic Haiku is the fastest at 1.8s median provider latency. Opus latency is remarkably consistent across providers (4.5-4.9s). Vertex Sonnet is the slowest at 6.2s, roughly 40% slower than the same model on Anthropic direct.
Reliability and ops
4 datasetsOperational metrics
How reliable is each LLM provider in production? In April 2026 the top eight providers on the Requesty gateway (OpenAI, Anthropic, Vertex (Gemini), Bedrock, DeepSeek, Novita, xAI) sat at 95-99% success rate. Azure trailed at 78%, Vertex (Claude) at 84%, Mistral at 86%, and Moonshot at 6%, a real reliability outlier. Streaming adoption is bimodal too: Azure 68%, Anthropic 57%, everyone else under 30%.
Provider errors
Why do LLM provider requests fail? Among April 2026 requests on the Requesty gateway where the upstream provider returned a non-success response, 65.8% were 429 (rate limit), 19.4% were 400 (bad request: schema mismatches, oversized payloads), and 9.4% were 403 (forbidden). 5xx availability incidents (503, 502, 529, 500, 504, 520) summed to ~4.8%. Router- and gateway-level rejections are filtered out so the chart shows only what providers themselves emit when they fail.
Policy vs direct reliability
How much does using a routing policy improve LLM reliability? In April 2026 the Requesty managed-fallback policy cohort hit 99.25% eventual success rate, vs 85.01% for users calling a single provider directly. That is a 14.2 pp lift, up from a +3.0 pp gap in January. Policy reliability held a tight 97.5-99.3% band across all four months while the direct cohort swung 12 pp; the widening is driven by direct-cohort regressions, not policy degradation.
Agent error rate
How reliable are AI coding agents? In April 2026, Roo Code leads with a 2.5% error rate across 147K calls. Claude Code sits at 7.0% across 494K calls. Forge trails at 11.2% across 1.1K calls. Kilo Code shows 10.0% error rate across 23K calls.
Crawl /data/llms.txt for an indexed list of every dataset with abstracts and machine-readable links.
llms.txtEach dataset ships APA + BibTeX, a permanent slug, and revision history. Time-windowed slugs never break.
See an exampleEvery dataset exports machine-readable JSON, CSV, and Markdown. Schemas are stable, units are explicit.
See JSON