Requesty
Back|MAY '26OBSERVABILITY / ROUTING
7 MIN READ|

What the gateway saw in April 2026: agents live on Anthropic, open-source models got fast, and the latency gap is 14×

Thibault Jaigu
Thibault Jaigu
CEO & Co-Founder
Last updated

A gateway sees what application code never does: the same prompt fanning out to a dozen different upstreams, the same model behind three different endpoints, the same workload pattern recurring across thousands of teams. Once a quarter we sit down with the production telemetry and ask "what is the data actually showing about how teams ship in 2026?". This post is the April 2026 read.

We are sticking to per-provider operational metrics (latency, success rate, finish_reason mix, streaming, prompt caching, reasoning tokens) and avoiding any cuts that imply absolute volume or share-of-platform. Every chart below answers a question of the form "what does this provider look like to a caller?", not "how much of the platform does it represent".

1. Anthropic-direct is the agentic provider on Requesty

The single sharpest finding in the April data is that Anthropic-direct does not look like any other provider. Roughly 52% of all successful Anthropic-direct completions in April 2026 finished with finish_reason = tool_calls, meaning the model called a tool rather than returning text. That is about 2× the next-highest provider (OpenAI's new responses endpoint at 26%, Azure at 23%) and 17× higher than OpenAI direct (3%).

finish_reason mix per provider, April 2026

Each bar is normalized to 100%. Click a legend item to isolate a series.

Anthropic-direct sits at 52% tool_calls. The highest agentic share on the platform and 2x the next provider.finish_reason wasn't populated before 2026, so this is Apr 2026 only. Source: Requesty production gateway.

A few things worth pulling out of that chart:

  • Anthropic-direct ≠ Bedrock Claude. Same model family, completely different workload. Bedrock Claude finishes 88% stop and 7% tool_calls. The two routes are serving two different cohorts: agentic / IDE-integrated traffic on Anthropic-direct, and more conversational / batched traffic on Bedrock.
  • OpenAI's responses endpoint is meaningfully more agentic than chat-completions. 26% tool_calls vs 3% on the legacy route. If you've migrated to responses and not seen this in your own logs yet, this is the canary.
  • The "blank/error" segment is real signal. NULL finish_reason correlates very tightly with successful = false. Moonshot at 94% blank and Google direct at 58% blank are both reliability outliers. It is not a labeling artefact; those routes are genuinely failing on a high fraction of calls. We discuss this further in the success-rate panel below.
  • Vertex Claude lands between Anthropic-direct and Bedrock at 14% tool_calls. Customers seem to use Vertex Claude as a "production hardened Claude with EU residency" route rather than as their agent harness.

If you are picking a route to point an autonomous agent harness at, this chart is the answer: Anthropic-direct, with Vertex Claude (for residency reasons) and openai-responses as the second-tier choices.

2. The latency leaderboard, April 2026

p50 total latency among the top providers spans almost 15× from fastest to slowest.

Latency leaderboard, April 2026 (top 10 providers by volume)

Switch between p50, p95 and time-to-first-token. Hover any row for all three.

Median time from request to last token, successful only.Latency only counted on successful requests. TTFT only counted on streamed-and-successful requests.

Two clusters stand out:

  • The "fast tier": xAI (0.6 s p50), Novita (0.8 s), Azure (1.0 s), Mistral (1.4 s). Mostly lighter / non-reasoning workloads, mostly small or distilled models, mostly answers in under a second.
  • The "frontier-Claude tier": Vertex (5.1 s), Anthropic (5.8 s), Bedrock (2.8 s), DeepSeek (9.1 s). Heavier models, longer outputs, much more variance: Anthropic's p95 is 53.9 s and DeepSeek's is 77.6 s. If you are running these as part of an interactive UX, you cannot afford to not stream.

The third panel, p50 time-to-first-token, streamed only, tells a different story:

  • Azure is the streaming-UX winner at 0.6 s TTFT, despite a 1 s total p50. First-token-fast, total-fast.
  • xAI is fast on completions but slow to first token (3.25 s TTFT, 0.6 s total). That is consistent with non-streaming behaviour or a buffered upstream: the model produces the whole answer quickly but doesn't start emitting tokens until late.
  • Vertex's TTFT (1.37 s) is genuinely competitive with the fast tier, even though its total p50 is 5.1 s. If you are picking a Claude route for an interactive product, Vertex starts faster than Anthropic-direct (2.13 s TTFT). Useful if the UX is "first token visible" rather than "answer complete".

Note: TTFT (first_token_latency_ns) only started being recorded on the gateway in 2026. We do not have it for 2025 traffic.

3. Open-source aggregator routes got dramatically faster YoY

Comparing Apr 2025 and Apr 2026 for the providers we have a year of data for, the cleanest pattern in the dataset is that the cheap-inference / OSS-aggregator tier is no longer slow:

p50 latency YoY: April 2025 to April 2026

Same provider tag, ≥50k requests in both months. Lower bars = faster requests.

Apr 2025Apr 2026Faster YoY
Apr 2025Apr 2026
YoY
Open-source aggregator routes (xAI, DeepInfra, Alibaba, Novita, Nebius) compressed by 89-93%. Frontier providers were already fast and barely moved.`successful` flag semantics likely changed between 2025 and 2026; latency YoY is robust because it's quantiles over wall-clock duration and not affected by that.
providerApr 2025 p50Apr 2026 p50YoY
xAI9.1 s0.6 s-93%
DeepInfra15.8 s1.4 s-91%
Alibaba5.8 s0.5 s-91%
Novita8.8 s0.8 s-91%
Nebius22.1 s2.3 s-89%
DeepSeek24.3 s9.1 s-63%
Google direct5.2 s3.3 s-37%
Vertex5.9 s5.1 s-14%
OpenAI2.6 s2.4 s-9%
Anthropic6.0 s5.8 s-3%

The OSS aggregator routes (xAI, DeepInfra, Alibaba, Novita, Nebius) used to be the slowest tier and are now the fastest. Most of them compressed by 89-93% in a single year. The frontier-provider tier (OpenAI, Anthropic, Vertex) was already fast and barely moved, which is consistent with frontier latency being throughput-bound (model size, decode steps) rather than infrastructure-bound. The middle of the pack is where infrastructure investment shows up.

Practical implication: the latency case for routing easy work to a cheap OSS path is much stronger in 2026 than it was in 2025. A year ago you'd pay 5-25 seconds for that hop. Today you pay sub-second.

4. Operational metrics: success rate, streaming, caching, BYOK

Same providers, four different per-provider metrics. Each bar is "% of that provider's own traffic". None of these are share-of-platform.

Operational metrics per provider, April 2026

Switch metrics. Hover any row to see all three at once.

Share of requests that completed without an upstream error. Moonshot at 6% is a real reliability outlier.Source: Requesty production gateway, April 2026.

Things to call out:

  • Success rate is bimodal. The big six (OpenAI, Anthropic, Vertex, Bedrock, DeepSeek, Novita) sit between 95-99%. Azure is the underperformer of the well-known providers at 78%. Mistral is at 86%. Moonshot is at 6.2%, a genuine, customer-visible reliability problem on that route. If you are routing to Moonshot in production, route around it.
  • Streaming adoption is sharply bimodal too. Azure (68%) and Anthropic (57%) are streaming-heavy. Everyone else is below 10%. Streaming on Anthropic tracks with the agentic IDE workloads in section 1; the Azure number tracks with chat-style enterprise apps that haven't migrated to non-streaming responses-style endpoints yet.
  • BYOK is asymmetric. 18% of OpenAI-direct traffic is "bring-your-own-key", but only 3% of Anthropic-direct, 0% of Vertex / Bedrock / Azure. Customers BYOK on the most commodity-priced API and pay through the gateway on the strategic ones, which is exactly the pattern you'd predict.

5. Prompt caching is the single biggest cost lever

Cache hit rate (cached_tokens / input_tokens) is the operational lever that separates "we ship to production" from "we have a credit problem". The April distribution by provider:

Cache hit rate per provider, April 2026

cached_tokens divided by input_tokens. Higher is cheaper and faster.

Anthropic-direct (77%) is the cache-hit leader. Vertex Claude (14%) is the surprise: same model family, ~5× lower cache hit, almost certainly a configuration gap.* Moonshot 88% is a measurement artefact: cached_tokens still records on partial streams that the gateway later marks as failed at 6% success rate.
  • Anthropic-direct at 77% cache hit is the best on the platform, by a wide margin. Combined with the 52% tool_calls share, the picture is clear: agentic workloads on Claude are highly repetitive (long shared system prompt, similar context per turn) and the prompt cache is doing exactly what it was designed to do.
  • Bedrock Claude at 57%, OpenAI at 36%, DeepSeek at 48%, healthy. These are all in the range where prompt caching is meaningfully reducing token spend.
  • Mistral at 4% is roughly the floor. Prompt caching is not currently a meaningful lever on that route.
  • We are showing the Moonshot bar at the top of section 4 only, not here, because its cache-hit reading (~88%) is a measurement artefact. The upstream records cached_tokens on partial streams that the gateway then marks as failed (success rate 6%), which inflates the cache-hit reading. Don't quote that number.

6. Reasoning is real, but concentrated in a few providers

In April 2026, the share of each provider's own output that is reasoning varies enormously:

Reasoning-token share of provider output, April 2026

reasoning_tokens / output_tokens within each provider. Pure reasoning routes versus mixed routes.

High-share providers (Groq, Coding, Google, xAI) almost exclusively serve thinking models. Frontier providers are around a third reasoning. Anthropic / Bedrock / Mistral / Moonshot are at zero, either no thinking models routed, or thinking content surfaced differently.Providers with negligible reasoning output in April are omitted (ratio too noisy).
  • Groq, Coding, Google direct, xAI, zai are 50-82% reasoning. Routes that primarily serve a small set of reasoning-heavy models (Gemini 3 thinking variants, Grok thinking, GLM thinking, etc.). Almost everything they emit is in the reasoning / chain-of-thought stream.
  • Vertex and OpenAI are ~36% reasoning. A meaningful and growing share, mostly Gemini 2.5 Flash / 3.x previews on Vertex and the GPT-5 family on OpenAI.
  • Azure is at 18%. The lower end of the frontier group, consistent with Azure customers leaning on GPT-4.1-class models more than the latest reasoning checkpoints.
  • Anthropic, Bedrock, Mistral, Moonshot are at 0%. Anthropic does not report reasoning tokens separately - extended thinking output is delivered inline. Mistral and Moonshot have no reasoning models routed through the gateway in this period.

The headline narrative in the broader industry is "everything is reasoning now". That is not what the data says. Reasoning is concentrated in a specific subset of providers and models, and even on the providers that emit it, the absolute volume is dwarfed by regular completion output. The interesting workload dimension is not "is this a reasoning model"; it is "is this an agent" (section 1).

Related reading