# Requesty Data — Full > Open data notes from the Requesty LLM gateway. Each note has a permanent URL, an interactive chart, key findings, caveats, and machine-readable downloads. Free to cite under CC BY 4.0. This file is the extended `llms-full.txt` variant: it inlines the full content of every note in this catalog so an AI agent can ingest the whole hub in a single fetch. The compact link-only index is at https://requesty.ai/data/llms.txt; the human-friendly catalog homepage is at https://requesty.ai/data. Each note also exposes a JSON endpoint at `/data.json`, a CSV at `/data.csv`, and an individual Markdown export at `/data.md`. License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). Attribution: "Requesty Data, https://requesty.ai/data". Source: Requesty production gateway. Server timezone is UTC. --- ## Topic: Agentic workloads --- # finish_reason mix per provider, April 2026 > Which AI providers serve the most agentic traffic? In April 2026 Anthropic-direct returned `finish_reason = tool_calls` on 52% of successful completions on the Requesty gateway, about 2× the next provider and 17× higher than OpenAI direct. OpenAI Responses (26%), Vertex (Claude) (23%) and Azure (23%) formed a clear second tier. Splitting Vertex into Gemini and Claude cohorts shows the gap inside that route: Vertex (Claude) 23% vs Vertex (Gemini) 13%. *Topic: Agentic workloads. Period: Apr 2026. Last updated 2026-05-09. Permanent URL: https://requesty.ai/data/finish-reason-mix-by-provider-april-2026.* ## Why it matters `finish_reason = tool_calls` is the cleanest signal that a model was driving an agent loop rather than answering a chat prompt. Providers cluster into clear agentic and non-agentic tiers, which has direct implications for routing. Sending agent traffic to a non-agentic provider often produces shorter context windows and worse tool-following without users realising why their agent feels "dumber". ## Questions this answers - Which LLM provider is best for agentic workloads? - What share of LLM traffic uses tool calls in 2026? - Which AI providers are best for AI agents? - Why does Anthropic dominate agent traffic vs OpenAI? ## Key findings 1. Anthropic-direct: 52% tool_calls, the highest agentic share on the platform. 2. OpenAI Responses (26%), Vertex (Claude) (23%) and Azure (23%) form a clear second tier. 3. Vertex (Claude) at 23% versus Vertex (Gemini) at 13%: same provider routing, different workload by an order of magnitude. 4. OpenAI direct is at 3% tool_calls, 17× lower than Anthropic-direct. 5. Bedrock Claude (7%) versus Anthropic-direct Claude (52%): same model, very different workload mix. 6. NULL finish_reason correlates with successful=false. Moonshot 94% blank is a reliability outlier on that route. ## Data | Provider | tool_calls (percent) | stop (percent) | length (percent) | blank/error (percent) | | --- | --- | --- | --- | --- | | Anthropic | 52.20% | 42.60% | 1.40% | 3.80% | | OpenAI Responses | 25.90% | 71.00% | 1.00% | 2.10% | | Vertex (Claude) | 23.40% | 56.20% | 5.30% | 15.10% | | Azure | 22.60% | 57.50% | 0.40% | 19.50% | | Vertex (Gemini) | 13.50% | 79.00% | 3.30% | 4.20% | | Bedrock | 6.70% | 88.50% | 0.50% | 4.30% | | Moonshot | 4.60% | 1.40% | 0.10% | 93.90% | | OpenAI | 3.30% | 94.20% | 0.60% | 1.90% | | xAI | 2.90% | 96.20% | 0.20% | 0.70% | | DeepSeek | 1.50% | 94.50% | 2.20% | 1.80% | ## Caveats - Apr 2026 only. finish_reason was not populated for any 2025 row. - Moonshot 94% blank/error is a reliability problem, not a labeling artefact (success rate 6.2%). ## Cite as **APA.** Requesty (2026). finish_reason mix per provider, April 2026. Requesty Data. https://requesty.ai/data/finish-reason-mix-by-provider-april-2026 ```bibtex @misc{requesty_finish_reason_mix_by_provider_april_2026, author = {{Requesty}}, title = {finish\_reason mix per provider, April 2026}, year = {2026}, howpublished = {\url{https://requesty.ai/data/finish-reason-mix-by-provider-april-2026}}, note = {Requesty Data} } ``` Downloads: [JSON](https://requesty.ai/data/finish-reason-mix-by-provider-april-2026/data.json) · [CSV](https://requesty.ai/data/finish-reason-mix-by-provider-april-2026/data.csv) · [Markdown](https://requesty.ai/data/finish-reason-mix-by-provider-april-2026/data.md) --- # finish_reason mix per model, April 2026 > Which AI models are used most for tool calling? In April 2026 Claude Opus 4.6 returned `finish_reason = tool_calls` 59% of the time on the Requesty gateway, the most agentic model on the platform. Gemini 2.5 Flash came second at 37%. Same-family Claude Sonnet 4.5 only 9%, and the entire OpenAI lineup (GPT-4o, GPT-4.1-mini, GPT-4.1-nano, GPT-5-mini) sat under 4%. *Topic: Agentic workloads. Period: Apr 2026. Last updated 2026-05-09. Permanent URL: https://requesty.ai/data/finish-reason-mix-by-model-april-2026.* ## Why it matters Two models from the same provider can have completely different agentic profiles, which means choosing a frontier model for an agent based on brand alone is a coin flip. The headline "Anthropic is agentic" framing on the per-provider chart is really an Opus 4.6 effect: Sonnet 4.5 behaves more like a chat model in production traffic, despite both being marketed as agentic-capable. ## Questions this answers - Which AI models are used most for tool calling? - Is Claude Opus more agentic than Claude Sonnet in production? - Which OpenAI models do AI agents use? - How agentic is Gemini 2.5 Flash compared to Claude? ## Key findings 1. claude-opus-4-6: 59% tool_calls. The single most agentic model on the platform. 2. gemini-2.5-flash: 37% tool_calls. The mid-tier general-purpose model that is doing real agentic work. 3. claude-sonnet-4-5: 9% tool_calls. The same provider, the same family, dramatically less agentic. 4. OpenAI lineup (gpt-4o, gpt-4.1-mini, gpt-4.1-nano, gpt-5-mini): all under 4% tool_calls. 5. Practical implication: the "agentic provider" framing on the per-provider chart is really an "Opus 4.6 effect". Anthropic-direct looks agentic because Opus is. ## Data | Model | tool_calls (percent) | stop (percent) | length (percent) | | --- | --- | --- | --- | | claude-opus-4-6 | 59.40% | 39.50% | 1.10% | | gemini-2.5-flash | 36.60% | 61.20% | 2.10% | | claude-sonnet-4-5 | 9.10% | 90.70% | 0.20% | | gpt-5-mini | 3.50% | 94.00% | 2.40% | | gpt-4o | 0.20% | 99.80% | 0.00% | | gpt-4.1-mini | 0.20% | 99.80% | 0.00% | | deepseek-chat | 0.50% | 97.20% | 2.30% | | gpt-4.1-nano | 0.00% | 99.90% | 0.00% | | gemini-2.5-flash-lite | 0.00% | 99.80% | 0.20% | | grok-4-1-fast | 0.10% | 99.80% | 0.10% | ## Caveats - finish_reason was not populated before 2026, so this is April 2026 only. - Aggregating finish_reason at the model level smooths over how the model is invoked. A model used inside an agent loop will show more tool_calls than the same model used in a one-shot chatbot. ## Cite as **APA.** Requesty (2026). finish_reason mix per model, April 2026. Requesty Data. https://requesty.ai/data/finish-reason-mix-by-model-april-2026 ```bibtex @misc{requesty_finish_reason_mix_by_model_april_2026, author = {{Requesty}}, title = {finish\_reason mix per model, April 2026}, year = {2026}, howpublished = {\url{https://requesty.ai/data/finish-reason-mix-by-model-april-2026}}, note = {Requesty Data} } ``` Downloads: [JSON](https://requesty.ai/data/finish-reason-mix-by-model-april-2026/data.json) · [CSV](https://requesty.ai/data/finish-reason-mix-by-model-april-2026/data.csv) · [Markdown](https://requesty.ai/data/finish-reason-mix-by-model-april-2026/data.md) --- # Token-weighted tool_calls share per provider, April 2026 > What share of LLM output tokens is spent on tool calls vs chat? In April 2026 on the Requesty gateway, Anthropic emitted 38.8% of its output tokens on `tool_calls` vs 54.2% of requests, so agentic completions are roughly 30% smaller than chat ones. OpenAI Responses showed the opposite: 34.2% of tokens vs 26.4% of requests. Vertex (Claude) had the biggest negative gap (6.1% of tokens vs 27.6% of requests). *Topic: Agentic workloads. Period: Apr 2026. Last updated 2026-05-09. Permanent URL: https://requesty.ai/data/tool-call-token-share-april-2026.* ## Why it matters Counting requests overweights short tool-call payloads; counting tokens overweights long chat replies. Two providers with the same request-level agentic share can have wildly different agentic token shares, which matters for capacity planning, billing reconciliation, and any benchmark that aggregates over tokens rather than calls. Pick the wrong axis and the same provider can look 5× more or less agentic than it actually is. ## Questions this answers - What share of AI output tokens is spent on tool calls? - Are tool-call payloads bigger or smaller than chat replies? - Why do request-counts and token-counts disagree on agentic share? - Which providers have the most token-heavy tool calls? ## Key findings 1. Anthropic: 38.8% of output tokens vs 54.2% of requests. Agentic completions are ~30% smaller than chat ones. tool_calls payloads are compact. 2. OpenAI Responses: 34.2% of output tokens vs 26.4% of requests. The opposite shape. agentic completions emit more tokens than chat ones. 3. Vertex (Claude): 6.1% of tokens vs 27.6% of requests. The biggest negative gap on the chart. Claude on Vertex is dominated by lots of small tool-call payloads, while chat completions on the same route are heavy. 4. Vertex (Gemini): 1.5% of tokens vs 14.1% of requests. Same shape as Vertex (Claude) but more extreme. Gemini chat replies are huge, so agentic completions barely register on the token-weighted view. 5. xAI: 17.2% of tokens vs 2.9% of requests. Few agentic calls, but each one is verbose. 6. OpenAI direct: 2.7% of tokens vs 3.4% of requests. The two views agree. there is barely any agentic load on this route in either framing. ## Data | Provider | Tool-call output-token share (percent) | Tool-call request share (percent) | Gap (token - request) (percent) | | --- | --- | --- | --- | | Moonshot | 54.70% | 75.00% | -20.30% | | Minimaxi | 52.50% | 50.80% | 1.70% | | Anthropic | 38.80% | 54.20% | -15.40% | | OpenAI Responses | 34.20% | 26.40% | 7.80% | | Azure | 18.00% | 27.90% | -9.90% | | xAI | 17.20% | 2.90% | 14.30% | | Bedrock | 14.40% | 7.00% | 7.40% | | Alibaba | 12.20% | 1.70% | 10.50% | | Vertex (Claude) | 6.10% | 27.60% | -21.50% | | Novita | 3.00% | 1.90% | 1.10% | | OpenAI | 2.70% | 3.40% | -0.70% | | Vertex (Gemini) | 1.50% | 14.10% | -12.60% | | DeepSeek | 1.20% | 1.50% | -0.30% | | Mistral | 1.00% | 1.90% | -0.90% | | Nebius | 0.90% | 3.50% | -2.60% | | Groq | 0.80% | 1.00% | -0.20% | | DeepInfra | 0.30% | 0.10% | 0.20% | ## Cite as **APA.** Requesty (2026). Token-weighted tool_calls share per provider, April 2026. Requesty Data. https://requesty.ai/data/tool-call-token-share-april-2026 ```bibtex @misc{requesty_tool_call_token_share_april_2026, author = {{Requesty}}, title = {Token-weighted tool\_calls share per provider, April 2026}, year = {2026}, howpublished = {\url{https://requesty.ai/data/tool-call-token-share-april-2026}}, note = {Requesty Data} } ``` Downloads: [JSON](https://requesty.ai/data/tool-call-token-share-april-2026/data.json) · [CSV](https://requesty.ai/data/tool-call-token-share-april-2026/data.csv) · [Markdown](https://requesty.ai/data/tool-call-token-share-april-2026/data.md) --- # Family share within OSS-routed traffic, Nov 2025 - Apr 2026 > Which open-weight AI model is most popular in 2026? On the Requesty gateway, OSS-routed traffic went from Qwen-dominated in late 2025 (34-38% share in Nov-Dec) to DeepSeek-dominated in January 2026 (77% after the R1 launch), and back to a genuinely diversified state by April (DeepSeek 47%, Kimi 17%, MiniMax 15%). Qwen collapsed from 38% to under 4% almost overnight when DeepSeek R1 shipped. *Topic: Agentic workloads. Period: Nov 2025 - Apr 2026. Last updated 2026-05-10. Permanent URL: https://requesty.ai/data/oss-family-share-jan-apr-2026.* ## Why it matters Open-source LLM leadership rotates on a months-not-years timescale: the "best open model" changes with each new release, and the long tail diversifies fast once any single model loses its lead. For teams hard-coding OSS choices into prompts or routing rules, that means yesterday's default is often already wrong. Kimi K2 quintupling in three months is the clearest current example. ## Questions this answers - Which open-source LLM is most popular in 2026? - Has DeepSeek overtaken Qwen for open-weight traffic? - How fast does open-source AI model leadership change? - Is Kimi K2 gaining real production traction? ## Key findings 1. Qwen (Alibaba): 34% in Nov, 38% in Dec, then collapsed to under 4% from January onward. DeepSeek R1 launch killed Qwen share overnight. 2. DeepSeek: 10% in Nov, exploded to 77% in Jan (R1 launch), declining since to 47% in Apr. 3. Kimi (Moonshot): volatile. 10% Nov, 16% Dec, collapsed to 2% Jan, back to 17% Apr. 4. MiniMax: 14% Nov, near-zero Dec, recovered to 15% by Apr. 5. The OSS tier went from concentrated (one family >33%) to diversified (no family >47%) in six months. ## Data | Month | DeepSeek (percent) | MiniMax (percent) | Kimi (Moonshot) (percent) | Mistral (percent) | GLM (Zhipu) (percent) | Qwen (Alibaba) (percent) | Llama (Meta) (percent) | GPT-OSS (OpenAI) (percent) | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | November | 10.29% | 13.59% | 9.64% | 8.66% | 14.01% | 33.64% | 2.51% | 7.64% | | December | 17.43% | 0.48% | 15.82% | 13.59% | 7.49% | 37.55% | 2.11% | 5.44% | | January | 76.70% | 6.39% | 1.73% | 6.47% | 3.17% | 3.85% | 0.68% | 1.02% | | February | 63.75% | 5.39% | 7.03% | 11.89% | 6.39% | 3.38% | 1.00% | 1.19% | | March | 65.17% | 13.96% | 4.39% | 4.99% | 7.52% | 1.61% | 1.57% | 0.80% | | April | 46.68% | 14.58% | 16.75% | 9.65% | 6.71% | 1.72% | 3.01% | 0.90% | ## Caveats - OSS is defined as traffic routed through open-source aggregator providers (not frontier APIs). The boundary is imperfect. ## Cite as **APA.** Requesty (2026). Family share within OSS-routed traffic, Nov 2025 - Apr 2026. Requesty Data. https://requesty.ai/data/oss-family-share-jan-apr-2026 ```bibtex @misc{requesty_oss_family_share_jan_apr_2026, author = {{Requesty}}, title = {Family share within OSS-routed traffic, Nov 2025 - Apr 2026}, year = {2026}, howpublished = {\url{https://requesty.ai/data/oss-family-share-jan-apr-2026}}, note = {Requesty Data} } ``` Downloads: [JSON](https://requesty.ai/data/oss-family-share-jan-apr-2026/data.json) · [CSV](https://requesty.ai/data/oss-family-share-jan-apr-2026/data.csv) · [Markdown](https://requesty.ai/data/oss-family-share-jan-apr-2026/data.md) --- # Reasoning-token share of provider output, April 2026 > How much of LLM output is reasoning/thinking tokens? In April 2026 on the Requesty gateway, Groq led at 82%, followed by Coding (79%), xAI (60%) and z.ai (51%). These routes are dominated by thinking models. Frontier routes ran around a third: Vertex (Gemini) 40%, OpenAI 36%, OpenAI Responses 33%. Anthropic and Bedrock report 0% because Anthropic does not surface reasoning tokens separately; extended thinking is delivered inline. *Topic: Agentic workloads. Period: Apr 2026. Last updated 2026-05-09. Permanent URL: https://requesty.ai/data/reasoning-token-share-by-provider-april-2026.* ## Why it matters The industry narrative is "everything is reasoning now", but the data says reasoning is concentrated in a specific subset of routes, and even there, absolute volume is dwarfed by regular completion output. The Anthropic and Bedrock 0% is a measurement artefact, not a usage signal, which matters for any cost or quality comparison that relies on the reasoning-tokens column. ## Questions this answers - How much LLM output is reasoning tokens? - Which providers use the most reasoning models in 2026? - Why does Anthropic show 0% reasoning tokens? - Are AI agents mostly thinking or mostly responding? ## Key findings 1. High-reasoning routes: Groq 82%, Coding 79%, xAI 60%, z.ai 51%. 2. Frontier routes around a third: Vertex (Gemini) 40%, OpenAI 36%, OpenAI Responses 33%. 3. Vertex (Claude) does not appear here: Anthropic does not report reasoning tokens separately, so Claude thinking output is not counted. 4. Azure at 18%, leans on GPT-4.1-class models more than the latest reasoning checkpoints. 5. Anthropic, Bedrock, Mistral, Moonshot: 0%. Anthropic does not report reasoning tokens separately (thinking is inline). Mistral and Moonshot have no reasoning models routed. 6. Industry narrative is "everything is reasoning now". The data says reasoning is concentrated in a specific subset of providers and even there the absolute volume is dwarfed by regular completion output. ## Data | Provider | Reasoning share (percent) | | --- | --- | | Groq | 82.30% | | Coding | 79.00% | | xAI | 59.70% | | z.ai | 51.30% | | Vertex (Gemini) | 39.90% | | Minimaxi | 37.20% | | OpenAI | 35.90% | | OpenAI Responses | 32.50% | | Azure | 18.10% | | Novita | 3.00% | | DeepSeek | 2.70% | ## Caveats - Reasoning tokens were not tracked before 2026, so this is April 2026 only. Year-over-year comparison is not possible. - A 0% reading does not necessarily mean a provider has no reasoning models - only that reasoning output is not reported separately on that route (e.g. Anthropic delivers thinking inline). ## Cite as **APA.** Requesty (2026). Reasoning-token share of provider output, April 2026. Requesty Data. https://requesty.ai/data/reasoning-token-share-by-provider-april-2026 ```bibtex @misc{requesty_reasoning_token_share_by_provider_april_2026, author = {{Requesty}}, title = {Reasoning-token share of provider output, April 2026}, year = {2026}, howpublished = {\url{https://requesty.ai/data/reasoning-token-share-by-provider-april-2026}}, note = {Requesty Data} } ``` Downloads: [JSON](https://requesty.ai/data/reasoning-token-share-by-provider-april-2026/data.json) · [CSV](https://requesty.ai/data/reasoning-token-share-by-provider-april-2026/data.csv) · [Markdown](https://requesty.ai/data/reasoning-token-share-by-provider-april-2026/data.md) --- ## Topic: Latency and performance --- # Latency leaderboard per provider, April 2026 > Which AI provider has the lowest latency in April 2026? On the Requesty gateway xAI led p50 at 0.6 s, with Novita (0.8 s), Azure (1.0 s) and Mistral (1.4 s) close behind. Vertex (Claude) was the slowest at 13.7 s, 23× the fastest and 2.8× slower than Vertex (Gemini) at 4.9 s on the same Vertex route. Anthropic-direct sat mid-pack at 5.8 s with a 52.6 s p95 long tail. *Topic: Latency and performance. Period: Apr 2026. Last updated 2026-05-09. Permanent URL: https://requesty.ai/data/provider-latency-leaderboard-april-2026.* ## Why it matters Total p50 latency is dominated by workload type, not pure provider speed. The 23× spread is partly silicon, partly streaming behaviour, but mostly the size and tool-call complexity of requests being sent. The Vertex-Claude tail is heavy agentic Claude Code traffic, not slow inference. Reading the leaderboard literally without that context will mislead any provider-selection decision. ## Questions this answers - Which LLM provider has the lowest latency in 2026? - What is the fastest LLM provider for chat completions? - Why is Vertex Claude so slow compared to Anthropic direct? - What is the p95 latency of OpenAI vs Anthropic? ## Key findings 1. p50 spans 23× from fastest to slowest: xAI 0.6 s to Vertex (Claude) 13.7 s. 2. Fast tier: xAI (0.6 s), Novita (0.8 s), Azure (1.0 s), Mistral (1.4 s). 3. Vertex split is striking: Vertex (Gemini) 4.9 s, Vertex (Claude) 13.7 s. Same provider routing, very different workload weight. 4. Frontier-Claude tier: Anthropic 5.8 s, with long-tail variance Anthropic p95 52.6 s, DeepSeek p95 74.0 s. 5. TTFT is decoupled. Azure is fastest to first token (0.6 s) despite a 1.0 s total p50. 6. xAI: fast on total but slow to first token (3.27 s TTFT). Suggests buffered or non-streaming upstream behaviour. ## Data | Provider | p50 latency (milliseconds) | p95 latency (milliseconds) | p50 TTFT (milliseconds) | | --- | --- | --- | --- | | xAI | 600 ms | 10.9 s | 3.27 s | | Novita | 800 ms | 18.5 s | 3.10 s | | Azure | 1.00 s | 8.80 s | 600 ms | | Mistral | 1.40 s | 9.80 s | 1.01 s | | OpenAI | 2.50 s | 17.9 s | 1.84 s | | Bedrock | 2.80 s | 23.8 s | 1.86 s | | Vertex (Gemini) | 4.90 s | 27.2 s | 1.28 s | | Anthropic | 5.80 s | 52.6 s | 2.14 s | | Moonshot | 5.90 s | 64.1 s | 2.62 s | | DeepSeek | 9.00 s | 74.0 s | 1.17 s | | Vertex (Claude) | 13.7 s | 115.2 s | 1.44 s | ## Caveats - TTFT (first_token_latency_ns) was not populated before 2026, so any TTFT YoY is impossible. - p95 is highly sensitive to the tail of long completions; treat it as an upper bound for "what the worst 5% of users feel" rather than a steady-state operating point. ## Cite as **APA.** Requesty (2026). Latency leaderboard per provider, April 2026. Requesty Data. https://requesty.ai/data/provider-latency-leaderboard-april-2026 ```bibtex @misc{requesty_provider_latency_leaderboard_april_2026, author = {{Requesty}}, title = {Latency leaderboard per provider, April 2026}, year = {2026}, howpublished = {\url{https://requesty.ai/data/provider-latency-leaderboard-april-2026}}, note = {Requesty Data} } ``` Downloads: [JSON](https://requesty.ai/data/provider-latency-leaderboard-april-2026/data.json) · [CSV](https://requesty.ai/data/provider-latency-leaderboard-april-2026/data.csv) · [Markdown](https://requesty.ai/data/provider-latency-leaderboard-april-2026/data.md) --- # Provider throughput density, April 2026 > How many tokens per second can each LLM provider sustain? In April 2026 on the Requesty gateway Groq led at 320 output tok/sec, 2.5× the next-fastest provider, attributable to its custom inference silicon. Vertex (Gemini) was second at 130 tok/sec, Mistral 120 tok/sec; OSS aggregator routes (Nebius, Minimaxi, DeepInfra) clustered at 23-26 tok/sec; Bedrock was slowest at 15 tok/sec, 21× behind Groq. *Topic: Latency and performance. Period: Apr 2026. Last updated 2026-05-10. Permanent URL: https://requesty.ai/data/provider-throughput-density-april-2026.* ## Why it matters Throughput density (output tokens per second of total wall-clock latency) is the right number to optimise streaming UX, not raw p50 latency. Two providers with identical p50 totals can deliver wildly different perceived speed depending on token rate. Vertex (Claude) is actually faster per-token than Anthropic-direct, despite higher total latency, because Vertex Claude requests emit roughly 3× more output tokens on average. ## Questions this answers - What is the fastest LLM provider in tokens per second? - How fast does Groq stream compared to Anthropic? - Which LLM has the best streaming throughput? - Is Vertex Claude faster than Anthropic direct in practice? ## Key findings 1. Groq leads at 320 tok/sec, 2.5× the next-fastest provider, attributable to its custom inference silicon. 2. Vertex (Gemini) is second at 130 tok/sec, followed by Mistral at 120 tok/sec. 3. Vertex (Claude) at 56 tok/sec is faster per-token than Anthropic-direct at 46 tok/sec, even though Vertex (Claude)'s total request latency is 2.4× higher (Vertex (Claude) requests emit ~3× more output tokens on average). 4. OSS-aggregator routes (Nebius, Minimaxi, DeepInfra) cluster in the 23-26 tok/sec band. 5. Bedrock is the slowest at 15 tok/sec, 21× behind Groq. ## Data | Provider | p50 tokens / sec | p50 ms / token (milliseconds) | | --- | --- | --- | | Groq | 320 | 3 ms | | Vertex (Gemini) | 130 | 8 ms | | Mistral | 120 | 8 ms | | xAI | 65 | 16 ms | | OpenAI | 57 | 18 ms | | Novita | 56 | 18 ms | | Vertex (Claude) | 56 | 18 ms | | Anthropic | 46 | 22 ms | | OpenAI Responses | 44 | 23 ms | | Azure | 39 | 26 ms | | DeepSeek | 31 | 32 ms | | Alibaba | 28 | 36 ms | | Moonshot | 27 | 37 ms | | Nebius | 26 | 39 ms | | Minimaxi | 24 | 41 ms | | DeepInfra | 24 | 42 ms | | Bedrock | 15 | 66 ms | ## Caveats - p50 of a per-request rate, not a global rate. Two providers with the same throughput density can have very different total latencies if their typical output sizes differ (Vertex Claude vs Anthropic-direct is the clearest example). - Computed on successful completions with output_tokens > 0 and total_latency_ns > 0. - Apr 2026 only; this is a snapshot, not a trend. ## Cite as **APA.** Requesty (2026). Provider throughput density, April 2026. Requesty Data. https://requesty.ai/data/provider-throughput-density-april-2026 ```bibtex @misc{requesty_provider_throughput_density_april_2026, author = {{Requesty}}, title = {Provider throughput density, April 2026}, year = {2026}, howpublished = {\url{https://requesty.ai/data/provider-throughput-density-april-2026}}, note = {Requesty Data} } ``` Downloads: [JSON](https://requesty.ai/data/provider-throughput-density-april-2026/data.json) · [CSV](https://requesty.ai/data/provider-throughput-density-april-2026/data.csv) · [Markdown](https://requesty.ai/data/provider-throughput-density-april-2026/data.md) --- # Streaming TTFT vs total latency, April 2026 > Which AI provider has the fastest time-to-first-token? In April 2026 on streaming-and-successful Requesty requests, Azure led TTFT at 593 ms with a 960 ms p50 total, the streaming-UX winner on both axes. xAI was among the fastest on total latency (5.68 s) but slowest to first token (3.27 s), which suggests buffered upstream behaviour rather than true streaming. Vertex (Gemini) and Vertex (Claude) sit at very different points: Gemini totals 3.05 s, Claude totals 8.03 s on the same Vertex route. *Topic: Latency and performance. Period: Apr 2026. Last updated 2026-05-09. Permanent URL: https://requesty.ai/data/streaming-ttft-vs-total-april-2026.* ## Why it matters Time-to-first-token is what users actually feel as latency in chat UIs. A 600 ms TTFT feels instantaneous; a 3 s TTFT feels broken even if total latency is the same. Buffered streaming masquerading as real streaming is a common antipattern in this dataset, and any latency benchmark that only quotes total p50 will miss it entirely. ## Questions this answers - What is the fastest streaming LLM provider? - Which LLM has the lowest time to first token in 2026? - Does xAI actually stream or is it buffered? - How does streaming affect perceived AI latency? ## Key findings 1. Azure: 593 ms p50 TTFT, 960 ms p50 total. The streaming-UX winner on both axes. 2. Nebius (659 ms TTFT) and OpenAI Responses (731 ms) are also strong on first-token speed. 3. Vertex (Gemini) 1.29 s TTFT vs Vertex (Claude) 1.44 s TTFT. Gemini totals 3.05 s, Claude totals 8.03 s. The Claude variant carries the heavy agentic completions on this route. 4. xAI: 5.68 s p50 total with 3.27 s TTFT. suggests upstream buffers responses before flushing rather than true streaming. 5. Anthropic: 2.14 s TTFT, 5.87 s total. slowest first byte among the very large providers, but consistent shape. ## Data | Provider | p50 TTFT (milliseconds) | p50 total (milliseconds) | p95 TTFT (milliseconds) | p95 total (milliseconds) | | --- | --- | --- | --- | --- | | Alibaba | 235 ms | 1.03 s | 4.82 s | 13.4 s | | Azure | 593 ms | 960 ms | 1.32 s | 3.35 s | | Nebius | 659 ms | 4.14 s | 4.21 s | 41.1 s | | OpenAI Responses | 731 ms | 6.69 s | 2.59 s | 41.5 s | | DeepInfra | 769 ms | 2.19 s | 1.26 s | 3.63 s | | Mistral | 1.01 s | 1.25 s | 5.35 s | 18.0 s | | DeepSeek | 1.17 s | 5.29 s | 3.04 s | 31.7 s | | Vertex (Gemini) | 1.29 s | 3.05 s | 19.6 s | 29.0 s | | Vertex (Claude) | 1.44 s | 8.03 s | 4.89 s | 100.3 s | | Bedrock | 1.85 s | 5.86 s | 7.72 s | 38.4 s | | OpenAI | 2.00 s | 6.36 s | 15.2 s | 26.0 s | | Anthropic | 2.14 s | 5.87 s | 4.46 s | 31.9 s | | Moonshot | 2.62 s | 7.49 s | 12.6 s | 52.9 s | | Minimaxi | 2.77 s | 6.14 s | 7.27 s | 24.7 s | | Novita | 3.13 s | 7.42 s | 9.67 s | 27.9 s | | xAI | 3.27 s | 5.67 s | 14.8 s | 20.9 s | ## Caveats - TTFT (first_token_latency_ns) was not populated before 2026, so YoY is impossible. - Vertex is split into Vertex (Gemini) and Vertex (Claude) by model_used; direct Google traffic is excluded as long-tail. - A non-streaming response that the gateway reports as is_stream=true (because the SDK was set to stream but the upstream did not) will measure TTFT close to total_latency, biasing the read upward. ## Cite as **APA.** Requesty (2026). Streaming TTFT vs total latency, April 2026. Requesty Data. https://requesty.ai/data/streaming-ttft-vs-total-april-2026 ```bibtex @misc{requesty_streaming_ttft_vs_total_april_2026, author = {{Requesty}}, title = {Streaming TTFT vs total latency, April 2026}, year = {2026}, howpublished = {\url{https://requesty.ai/data/streaming-ttft-vs-total-april-2026}}, note = {Requesty Data} } ``` Downloads: [JSON](https://requesty.ai/data/streaming-ttft-vs-total-april-2026/data.json) · [CSV](https://requesty.ai/data/streaming-ttft-vs-total-april-2026/data.csv) · [Markdown](https://requesty.ai/data/streaming-ttft-vs-total-april-2026/data.md) --- # p50 latency YoY: April 2025 vs April 2026 > Has LLM latency improved over the past year? On the Requesty gateway, open-source aggregator routes compressed dramatically between April 2025 and April 2026. xAI fell 93% (9.1 s to 0.6 s), DeepInfra 91% (15.8 s to 1.4 s), DeepSeek 62% (24.3 s to 9.2 s). Frontier providers barely moved (OpenAI -5%, Anthropic 0%). Vertex (Claude) is the only major route that got slower, +131%, as heavy agentic Claude Code workloads landed on it. *Topic: Latency and performance. Period: Apr 2025 to Apr 2026. Last updated 2026-05-09. Permanent URL: https://requesty.ai/data/provider-latency-yoy-april-2026.* ## Why it matters The OSS-aggregator tier closed most of the latency gap to frontier providers in 12 months: routing easy work onto a cheap OSS path used to cost 5-25 seconds and now costs sub-second. Workload composition is the dominant force on aggregate latency. Vertex (Claude) getting 2.3× slower while the underlying inference stack barely changed shows that "is provider X fast?" is the wrong question to ask in isolation. ## Questions this answers - How has LLM latency changed from 2025 to 2026? - Are open-source LLMs as fast as OpenAI now? - Which AI providers got faster in 2026? - Why are some LLM routes getting slower year-over-year? ## Key findings 1. OSS aggregator routes (xAI, DeepInfra, Alibaba, Novita, Nebius) compressed 89-93% YoY. 2. xAI: 9.1 s to 0.6 s (-93%). DeepInfra: 15.8 s to 1.4 s (-91%). 3. DeepSeek: 24.3 s to 9.2 s (-62%). Still slow but dramatically faster. 4. Frontier providers barely moved: OpenAI -5%, Anthropic 0%. 5. Vertex (Claude) is the lone exception: 6.0 s to 13.8 s (+131%). The route stayed put while heavy agentic Claude Code workloads moved onto it, so the work itself got bigger. 6. Practical implication: routing easy work to a cheap OSS path used to cost 5-25 seconds, now costs sub-second. ## Data | Provider | Apr 2025 p50 (milliseconds) | Apr 2026 p50 (milliseconds) | YoY delta (percent) | | --- | --- | --- | --- | | xAI | 9.10 s | 600 ms | -93.00% | | DeepInfra | 15.8 s | 1.40 s | -91.00% | | Alibaba | 5.80 s | 500 ms | -91.00% | | Novita | 8.80 s | 800 ms | -91.00% | | Nebius | 22.1 s | 2.30 s | -89.00% | | DeepSeek | 24.3 s | 9.20 s | -62.00% | | Coding | 7.90 s | 6.10 s | -23.00% | | OpenAI | 2.60 s | 2.50 s | -5.00% | | Anthropic | 5.90 s | 5.90 s | 0.00% | | Vertex (Claude) | 6.00 s | 13.8 s | 131.00% | ## Caveats - Vertex (Gemini) had no meaningful 2025 traffic so it is not in this chart. Only Vertex (Claude) is YoY-comparable. - Vertex (Claude) Apr 2025 sample is small and the workload that lived on it has changed substantially, so the +131% delta is more about workload mix than a true latency regression. - Customer-base composition changed YoY, so the workload mix hitting these providers is different. Latency YoY is robust to this because it is wall-clock duration not affected by the request mix in aggregate, but interpret it as "providers behave differently AND the work has shifted", not as a controlled experiment. - The `successful` flag semantics may have changed between 2025 and 2026, but quantiles over wall-clock duration are not affected. ## Cite as **APA.** Requesty (2026). p50 latency YoY: April 2025 vs April 2026. Requesty Data. https://requesty.ai/data/provider-latency-yoy-april-2026 ```bibtex @misc{requesty_provider_latency_yoy_april_2026, author = {{Requesty}}, title = {p50 latency YoY: April 2025 vs April 2026}, year = {2026}, howpublished = {\url{https://requesty.ai/data/provider-latency-yoy-april-2026}}, note = {Requesty Data} } ``` Downloads: [JSON](https://requesty.ai/data/provider-latency-yoy-april-2026/data.json) · [CSV](https://requesty.ai/data/provider-latency-yoy-april-2026/data.csv) · [Markdown](https://requesty.ai/data/provider-latency-yoy-april-2026/data.md) --- # Prompt-cache hit rate per provider, April 2026 > Which AI providers have the highest prompt-cache hit rate? In April 2026 Anthropic-direct led the Requesty gateway at 77% (cached_tokens / input_tokens), Bedrock Claude was healthy at 57%, and Vertex (Claude) trailed at 24%. Same Claude model family, 3× lower hit rate. Vertex (Gemini) sat at 10% and Mistral at 4%, the floor among major routes. *Topic: Latency and performance. Period: Apr 2026. Last updated 2026-05-09. Permanent URL: https://requesty.ai/data/cache-hit-rate-by-provider-april-2026.* ## Why it matters Prompt caching directly cuts the per-request cost of long, repeated context. The difference between a 77% hit rate and a 24% hit rate on the same model family is roughly a 3× reduction in input tokens billed at full price. The Vertex-Claude gap looks like a configuration issue rather than a platform limitation, which means Claude users on Vertex are leaving substantial savings on the table without a code change. ## Questions this answers - Which AI providers have the best prompt caching hit rate? - Why is prompt caching so much worse on Vertex Claude than on Anthropic direct? - How much does prompt caching reduce LLM inference cost in production? - Which providers should I avoid if I rely on prompt caching? ## Key findings 1. Anthropic-direct: 77% cache hit, the leader by a wide margin. 2. Bedrock Claude: 57%. OpenAI: 36%. DeepSeek: 48%. Healthy. 3. Vertex (Claude): 24%. Same model as Anthropic-direct (77%) and Bedrock (57%), 3× lower hit rate. Configuration gap. 4. Vertex (Gemini): 10%. The floor among major routes. 5. Mistral: 4%. Roughly the floor; prompt caching is not a meaningful lever on that route today. 6. Moonshot reports 88% but it is a measurement artefact at 6% success rate; do not quote it. ## Data | Provider | Cache hit rate (percent) | | --- | --- | | Anthropic | 77.50% | | Bedrock | 56.90% | | DeepSeek | 48.30% | | Azure | 41.00% | | OpenAI | 36.40% | | xAI | 35.70% | | Novita | 31.90% | | Vertex (Claude) | 23.50% | | Vertex (Gemini) | 9.60% | | Mistral | 4.10% | ## Caveats - Moonshot 88% cache-hit reading is a measurement artefact at 6% success rate. Excluded from the leader panel. - cached_tokens semantics differ slightly by provider (which tokens count as "cached"). The ratio is meaningful but not strictly apples-to-apples across providers. ## Cite as **APA.** Requesty (2026). Prompt-cache hit rate per provider, April 2026. Requesty Data. https://requesty.ai/data/cache-hit-rate-by-provider-april-2026 ```bibtex @misc{requesty_cache_hit_rate_by_provider_april_2026, author = {{Requesty}}, title = {Prompt-cache hit rate per provider, April 2026}, year = {2026}, howpublished = {\url{https://requesty.ai/data/cache-hit-rate-by-provider-april-2026}}, note = {Requesty Data} } ``` Downloads: [JSON](https://requesty.ai/data/cache-hit-rate-by-provider-april-2026/data.json) · [CSV](https://requesty.ai/data/cache-hit-rate-by-provider-april-2026/data.csv) · [Markdown](https://requesty.ai/data/cache-hit-rate-by-provider-april-2026/data.md) --- ## Topic: Reliability and ops --- # Operational metrics per provider, April 2026 > How reliable is each LLM provider in production? In April 2026 the top eight providers on the Requesty gateway (OpenAI, Anthropic, Vertex (Gemini), Bedrock, DeepSeek, Novita, xAI) sat at 95-99% success rate. Azure trailed at 78%, Vertex (Claude) at 84%, Mistral at 86%, and Moonshot at 6%, a real reliability outlier. Streaming adoption is bimodal too: Azure 68%, Anthropic 57%, everyone else under 30%. *Topic: Reliability and ops. Period: Apr 2026. Last updated 2026-05-09. Permanent URL: https://requesty.ai/data/operational-metrics-by-provider-april-2026.* ## Why it matters Provider success rate translates directly into user-visible failures unless an application has a managed fallback chain. The 95-99% top tier is comfortably reliable; Vertex (Claude) and Azure visibly failing roughly 1 in 5 calls demands either a routing policy or active provider switching at the application layer to avoid sustained user pain. ## Questions this answers - Which LLM provider is most reliable in production? - What is the success rate of OpenAI vs Anthropic vs Vertex? - Why do some LLM providers fail more often than others? - How widely is streaming adopted across LLM providers? ## Key findings 1. Success is bimodal: top tier at 95 to 99%, Vertex (Claude) 84%, Azure 78%, Mistral 86%, Moonshot 6%. 2. Streaming adoption is bimodal: Azure 68% and Anthropic 57%. Vertex (Claude) at 28%. Everyone else <10%. 3. Cache hit rate ranges from Anthropic-direct 77% to Vertex (Claude) 24% (same model family, 3x spread). ## Data | Provider | Success rate (percent) | Streaming (percent) | Cache hit (percent) | | --- | --- | --- | --- | | xAI | 99.30% | 1.30% | 35.70% | | DeepSeek | 98.30% | 2.80% | 48.30% | | OpenAI | 98.00% | 7.20% | 36.40% | | Novita | 97.20% | 2.30% | 31.90% | | Anthropic | 96.00% | 56.90% | 77.50% | | Vertex (Gemini) | 95.90% | 3.70% | 9.60% | | Bedrock | 95.60% | 9.70% | 56.90% | | Mistral | 86.30% | 8.00% | 4.10% | | Vertex (Claude) | 84.40% | 27.60% | 23.50% | | Azure | 78.00% | 68.30% | 41.00% | | Moonshot | 6.20% | 4.80% | 88.20% | ## Caveats - Apr 2025 success rates are anomalously low (OpenAI 54%, Anthropic 72%) and are likely under-reported because status_code wasn't being captured then. Mar to Apr 2026 success-rate comparisons are reliable; YoY success-rate deltas should be treated softly. ## Cite as **APA.** Requesty (2026). Operational metrics per provider, April 2026. Requesty Data. https://requesty.ai/data/operational-metrics-by-provider-april-2026 ```bibtex @misc{requesty_operational_metrics_by_provider_april_2026, author = {{Requesty}}, title = {Operational metrics per provider, April 2026}, year = {2026}, howpublished = {\url{https://requesty.ai/data/operational-metrics-by-provider-april-2026}}, note = {Requesty Data} } ``` Downloads: [JSON](https://requesty.ai/data/operational-metrics-by-provider-april-2026/data.json) · [CSV](https://requesty.ai/data/operational-metrics-by-provider-april-2026/data.csv) · [Markdown](https://requesty.ai/data/operational-metrics-by-provider-april-2026/data.md) --- # Provider error code distribution, April 2026 > Why do LLM provider requests fail? Among April 2026 requests on the Requesty gateway where the upstream provider returned a non-success response, 65.8% were 429 (rate limit), 19.4% were 400 (bad request: schema mismatches, oversized payloads), and 9.4% were 403 (forbidden). 5xx availability incidents (503, 502, 529, 500, 504, 520) summed to ~4.8%. Router- and gateway-level rejections are filtered out so the chart shows only what providers themselves emit when they fail. *Topic: Reliability and ops. Period: Apr 2026. Last updated 2026-05-09. Permanent URL: https://requesty.ai/data/status-code-distribution-april-2026.* ## Why it matters Provider failures are dominated by rate-limiting under agentic load, not by genuine availability incidents. That changes the right mitigation: backoff plus a managed fallback chain absorbs the ~85% of failures that are 429 + 400 without provider changes; only the ~5% 5xx tail is irreducible. Designing retries on the assumption that "providers go down" misallocates engineering effort. ## Questions this answers - Why do LLM API requests fail? - What is the most common LLM provider error code? - How often do AI providers rate-limit requests? - What HTTP errors return from OpenAI and Anthropic? ## Key findings 1. 429 (rate limit) is the dominant provider failure mode at 65.8%. Providers throttle agentic workloads aggressively. 2. 400 (bad request) is second at 19.4%. Schema mismatches, unsupported parameters, oversized payloads. 3. 403 (forbidden) at 9.4%. Provider-side authorization, region, or model-access denials. 4. 5xx total (503, 502, 529, 500, 504, 520) sums to ~4.8%. Real provider availability incidents are uncommon but not zero. 5. Codes that disappear under this filter (404 collapses from 29.8% to 0.2%, 402 from 17.8% to 0.07%) confirm those rejections are router-level model-not-found and billing checks, not provider failures. ## Data | Status code | Description | Bucket | % of rejections (percent) | | --- | --- | --- | --- | | 429 | Too Many Requests | auth_quota | 65.83% | | 400 | Bad Request | client_error | 19.40% | | 403 | Forbidden | auth_quota | 9.41% | | 503 | Service Unavailable | server_error | 2.19% | | 502 | Bad Gateway | gateway | 1.81% | | 529 | Site Overloaded | server_error | 0.52% | | 422 | Unprocessable | client_error | 0.24% | | 500 | Internal Server | server_error | 0.21% | | 404 | Not Found | not_found | 0.21% | | 402 | Payment Required | auth_quota | 0.07% | | 504 | Gateway Timeout | gateway | 0.06% | | 401 | Unauthorized | auth_quota | 0.02% | | 520 | Cloudflare Unknown | server_error | 0.02% | | 499 | Client Closed | client_error | 0.01% | ## Caveats - Restricted to status_code_origin = 'provider' AND successful = false, so router- and gateway-level rejections are excluded by design. - A failed request can have multiple retries with different status codes; each retry is counted separately. ## Cite as **APA.** Requesty (2026). Provider error code distribution, April 2026. Requesty Data. https://requesty.ai/data/status-code-distribution-april-2026 ```bibtex @misc{requesty_status_code_distribution_april_2026, author = {{Requesty}}, title = {Provider error code distribution, April 2026}, year = {2026}, howpublished = {\url{https://requesty.ai/data/status-code-distribution-april-2026}}, note = {Requesty Data} } ``` Downloads: [JSON](https://requesty.ai/data/status-code-distribution-april-2026/data.json) · [CSV](https://requesty.ai/data/status-code-distribution-april-2026/data.csv) · [Markdown](https://requesty.ai/data/status-code-distribution-april-2026/data.md) --- # Policy vs direct eventual success rate, Jan-Apr 2026 > How much does using a routing policy improve LLM reliability? In April 2026 the Requesty managed-fallback policy cohort hit 99.25% eventual success rate, vs 85.01% for users calling a single provider directly. That is a 14.2 pp lift, up from a +3.0 pp gap in January. Policy reliability held a tight 97.5-99.3% band across all four months while the direct cohort swung 12 pp; the widening is driven by direct-cohort regressions, not policy degradation. *Topic: Reliability and ops. Period: Jan 2026 - Apr 2026. Last updated 2026-05-10. Permanent URL: https://requesty.ai/data/policy-eventual-success-trend-jan-april-2026.* ## Why it matters A managed fallback chain absorbs upstream provider incidents that direct callers experience as user-visible failures. Over the four months of 2026 plotted here, direct callers gave up 12 percentage points of reliability while the policy cohort barely moved. Same upstream events, opposite outcomes. That is the clearest measurable case for using an LLM gateway over calling provider APIs directly. ## Questions this answers - How reliable are LLM routing policies vs calling providers directly? - Does using an LLM gateway actually improve reliability? - What success rate do AI gateways deliver in 2026? - How much do managed fallback chains improve LLM uptime? ## Key findings 1. Policy reliability widened its lead over direct from +3.0 pp in January to +14.2 pp in April. 2. April 2026: policy 99.25%, direct 85.01%. Policies eliminated 14 percentage points of failures that direct customers absorbed. 3. Policy rate has held a tight 97.5-99.3% band for four months. Direct rate swings 12 pp (97.5% in Feb to 85.0% in Apr) because direct calls have no fallback to absorb provider-side incidents. 4. The Mar-Apr widening is driven by direct-cohort regressions, not policy degradation. Policies absorbed the same upstream issues through their fallback chain. ## Data | Month | Policy - eventual success (percent) | Direct - eventual success (percent) | | --- | --- | --- | | January | 97.50% | 94.50% | | February | 98.72% | 97.47% | | March | 98.55% | 86.72% | | April | 99.25% | 85.01% | ## Caveats - Eventual success is computed at the request_id level: max(successful) across all attempts in the fallback chain. - The direct cohort includes every non-policy provider_requested value. Volume is large enough that the headline rates are stable. ## Cite as **APA.** Requesty (2026). Policy vs direct eventual success rate, Jan-Apr 2026. Requesty Data. https://requesty.ai/data/policy-eventual-success-trend-jan-april-2026 ```bibtex @misc{requesty_policy_eventual_success_trend_jan_april_2026, author = {{Requesty}}, title = {Policy vs direct eventual success rate, Jan-Apr 2026}, year = {2026}, howpublished = {\url{https://requesty.ai/data/policy-eventual-success-trend-jan-april-2026}}, note = {Requesty Data} } ``` Downloads: [JSON](https://requesty.ai/data/policy-eventual-success-trend-jan-april-2026/data.json) · [CSV](https://requesty.ai/data/policy-eventual-success-trend-jan-april-2026/data.csv) · [Markdown](https://requesty.ai/data/policy-eventual-success-trend-jan-april-2026/data.md)