What is agentic routing?

Agentic routing is the decision layer inside a multi-agent or compound LLM system that classifies each input and directs it to the most appropriate downstream specialist — a tool, a sub-agent, a different model, or a human. It is pattern #2 in Anthropic's five canonical workflow patterns for effective AI agents.

How much overhead does Requesty add compared to calling an LLM provider directly?

Requesty adds roughly 16ms of overhead per request in production measurements, versus ~55ms for OpenRouter and ~124ms for self-hosted LiteLLM. That latency buys you failover retries, load balancing, caching, observability, and multi-region routing without touching application code.

How much can routing reduce LLM costs?

Well-tuned routing retains ~98% of quality at roughly 50% of the cost according to Orq.ai's Auto Router benchmarks (Feb 2026). RouteLLM benchmarks report 30–80% savings depending on workload. IBM estimates up to 85% inference cost reduction by diverting queries to smaller models. Gateway-level prompt caching on Anthropic and Gemini can reduce input-token cost by up to 90% on cache hits.

What are the main routing techniques?

Four mainstream techniques: LLM-based (ask a model to classify), embedding-based (vector similarity to categories), rule-based (deterministic if/else), and ML-classifier (small dedicated model trained for routing). In production gateways these compose into declarative routing policies — fallback chains, weighted load balancing, and latency-based selection.

What is the biggest failure mode in agentic routing?

A single misclassification cascades: the wrong agent consumes tokens, produces a wrong output, and hands that to the next step. The gateway mitigation is to decouple misrouting (classification error) from outage (provider error) via explicit fallback policies so only one kind of routing failure reaches the application.

How does routing fit with Anthropic's five agent workflow patterns?

Routing is pattern #2 in Anthropic's taxonomy (Prompt Chaining, Routing, Parallelisation, Orchestrator-Workers, Evaluator-Optimizer). In practice routing sits inside the other patterns — chains route between steps, orchestrators route subtasks, evaluators route to critics. That's why a gateway-layer policy primitive is reusable across all five.

Agentic routing, benchmarked: Requesty adds 16ms of overhead, OpenRouter adds 55ms

Agentic routing is the decision layer inside a multi-agent LLM system that classifies each input and directs it to the right downstream specialist — a tool, sub-agent, model, or human. It is pattern #2 in Anthropic's five canonical workflow patterns for effective AI agents, and — per the 2026 benchmarks below — the single highest-leverage decision you can make about LLM cost.

This post is the short version: what agentic routing is, what it costs, what it saves, how the gateways compare on overhead, and where it breaks. Every claim is linked to its source.

The 150-token answer

An agentic router classifies a request, picks a model (or sub-agent), and handles what happens on failure. Most implementations combine four techniques in one policy: LLM-based classification, embedding-based similarity, rule-based if/else, and a small ML classifier. Production gateways expose these as declarative policies you reference by name, not code. The point is cost: route easy work to cheap models, reserve frontier models for hard work, and let the 80/20 split do the math. Benchmarks (below) show ~50% cost cuts at ~98% quality retention. Gateway overhead varies from ~16ms (Requesty) to ~124ms (LiteLLM).

Overhead: the gateway benchmark

Gateway	Overhead per request	Notes
Requesty	~16 ms	Hosted, includes fallback, LB, caching, and observability
OpenRouter (managed)	~55 ms	Hosted, comparable feature footprint to Requesty
LiteLLM (Python)	~124 ms	Self-hosted, narrow feature set

Requesty routes ~3.4× faster than OpenRouter and ~7.8× faster than LiteLLM on the same class of workload.

Why the gap? Two reasons. First, Requesty's hot path is written to the OpenAI shape end-to-end, so there's no translation layer on the majority of requests. Second, policy evaluation is precompiled — when you call model: "policy/prod", the router doesn't re-parse the policy on every request.

What routing actually saves

The reason teams adopt a router at all is cost. Four independent 2026 data points:

~50% cost reduction at ~98% quality retention with well-chosen model pairs (Orq.ai Auto Router, Feb 2026)
30–80% savings on RouteLLM benchmarks depending on workload (Pondhouse Data)
Up to 85% inference cost reduction by diverting easy queries to smaller models (IBM, via Requesty enterprise routing brief)
Up to 90% input-token savings on cache hits for long system prompts on Anthropic and Gemini (Requesty Auto Caching)

One finding is worth flagging because it's counterintuitive. A public audit of LangGraph's default patterns scored token efficiency at 39/100. The single biggest issue: binary classification calls (~50–100 tokens in, one word out) were running on the same frontier model used for synthesis — costing 10–15× more than running them on Haiku- or Flash-class models. (DEV Community audit, Mar 2026)

That single fix — classify with a cheap model, synthesise with a capable one — is what a routing gateway makes a config change instead of a code change.

The four routing techniques

Per the NivaLabs taxonomy, production routing composes from four primitives:

Technique	How it decides	Latency	Flexibility
LLM-based	Ask a model to classify	High	High
Embedding-based	Vector similarity to categories	Medium	High
Rule-based	Deterministic if/else on metadata	Zero	Low
ML classifier	Small dedicated model trained offline	Low	Medium

Most mature systems layer these. A rule-based filter catches obvious cases first (free plan → cheap model, region=EU → EU model), falls through to an embedding or classifier for fuzzy routing, and only reaches for an LLM-based router when the other three can't decide.

Routing in the bigger picture — Anthropic's five patterns

Anthropic's December 2024 essay Building Effective Agents established the canonical taxonomy the industry has standardised on. Routing is pattern #2 of five — and most production agent systems compose several of these together.

#	Pattern	What it does
1	Prompt Chaining	Sequential LLM calls where each step processes the previous output
2	Routing	Classifies input and directs it to a specialised follow-up task
3	Parallelisation	Runs LLMs concurrently (sectioning or voting)
4	Orchestrator–Workers	A central LLM dynamically delegates subtasks to workers
5	Evaluator–Optimizer	One LLM generates, another critiques and iterates

The practical implication: routing usually sits inside one of the other patterns. A prompt chain routes between steps. An orchestrator routes subtasks to specialist workers. An evaluator-optimiser routes requests back to a critic agent. That's why gateway-layer routing is the most reusable surface — one policy primitive serves all five.

How Requesty composes this in production

In practice you declare a policy once and reference it as policy/<name> in the model field. No application code changes when you tweak the strategy.

Python

from openai import OpenAI
 
client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="your-requesty-api-key",
)
 
response = client.chat.completions.create(
    model="policy/eu-claude-resilient",  # fallback + region + caching
    messages=[{"role": "user", "content": "Summarise this."}],
    extra_body={"requesty": {"auto_cache": True}},
)

That single call composes a fallback chain (fallback policy docs), weighted load balancing (LB policy docs), latency-aware selection (latency routing docs), prompt caching, and EU data residency — all evaluated in roughly 16ms on top of the provider call.

Policies compose. A latency policy can sit inside a fallback chain. A load-balancing policy can split traffic between two fallback chains to canary a new strategy. That composition is the whole point of a policy-based router: you stop writing retry and failover logic in application code.

EU data residency is a routing primitive, not an afterthought

For GDPR-bound workloads, swap the base URL to https://router.eu.requesty.ai/v1 and every router-side operation — request processing, logging, caching, analytics — stays inside AWS eu-central-1 (Frankfurt). Combined with the Model Library's region filter, you can restrict inference to EU-region models only (Bedrock @eu-central-1 / @eu-west-1 / @eu-north-1, Vertex @europe-west1, Azure @francecentral, Mistral). Non-approved models are rejected by design, not by convention. See the EU routing docs for the full region list.

That turns "are we GDPR-compliant?" from a 40-page architecture review into a config flag.

Where routing breaks

Three failure modes, ranked by impact:

Misclassified input cascade. Wrong agent gets the request, consumes tokens, produces a plausible-looking wrong answer, hands it to the next step. Patronus documents the compounding-hallucination case (AI Agent Routing, Patronus). Mitigation: small specialist classifiers with high precision, not free-form LLM routers.
Hallucinated route names. LLM returns a route that doesn't exist. LivePerson has an explicit fallback in their router for exactly this — "if a route is made up... the fallback flow for failures begins" (LivePerson docs). Mitigation: constrained generation or a validator that rejects unknown labels.
Provider outage confused with routing error. Your app sees a 500 and doesn't know whether the router misrouted or the model failed. Mitigation: fallback policies handle provider errors transparently so the only failure your app sees is a real routing problem.

On the last one: the 2025 Financial Times reported agent hallucinations causing misrouted customer-service escalations in large banks. AgilePoint's enterprise pilots reduced hallucinations 40% by layering business rules on top of RAG grounding. (AgilePoint / Medium)

The market context

For teams deciding whether to invest in routing now vs later:

Global agentic AI market 2026: $9.14B, projected $139.19B by 2034 at 40.5% CAGR (Fortune Business Insights)
51% of enterprises run AI agents in production; another 23% are actively scaling (Ringly 2026 summary)
70,000+ developers route through Requesty today, processing 90+ billion tokens per day across 300+ models (Requesty Quickstart)
The adoption-vs-production gap: "Almost four in five enterprises have adopted AI agents in some form, yet only one in nine runs them in production" — a 68-percentage-point gap, the largest deployment backlog in enterprise tech history (Digital Applied, Mar 2026)
Over 40% of agentic AI projects are at risk of cancellation by 2027 without governance, observability, and ROI clarity (Gartner, via Salesmate)
The Model Context Protocol (MCP) reached 97M downloads with 1,000+ servers in its ecosystem within months of release — now the de facto agent-interoperability standard (Digital Applied). Gateway-level MCP management (auth, tool whitelisting, per-server analytics) is emerging as a control point — see the Requesty MCP Gateway.

The last three stats are the interesting cluster. 80% adopted, 11% in production, 40% of those at risk of being cancelled — and MCP is eating the interop layer underneath all of it. Routing is a governance and observability surface, not just a cost-optimiser. It's the single choke point where every LLM call and every MCP tool call in your org is classified, tagged, logged, and budgeted. Teams that set up a gateway early ship agents. Teams that skip it are in the 40%.

The one-line summary

Agentic routing is the cheapest performance and governance win in modern LLM stacks. A gateway that adds 16ms of overhead can return 50–85% of inference spend, prevent misrouting cascades, and give you one place to control cost, compliance, and failover. The only wrong answer is not having one.

Frequently asked questions

What is agentic routing?: Agentic routing is the decision layer inside a multi-agent or compound LLM system that classifies each input and directs it to the most appropriate downstream specialist — a tool, a sub-agent, a different model, or a human. It is pattern #2 in Anthropic's five canonical workflow patterns for effective AI agents.
How much overhead does Requesty add compared to calling an LLM provider directly?: Requesty adds roughly 16ms of overhead per request in production measurements, versus ~55ms for OpenRouter and ~124ms for self-hosted LiteLLM. That latency buys you failover retries, load balancing, caching, observability, and multi-region routing without touching application code.
How much can routing reduce LLM costs?: Well-tuned routing retains ~98% of quality at roughly 50% of the cost according to Orq.ai's Auto Router benchmarks (Feb 2026). RouteLLM benchmarks report 30–80% savings depending on workload. IBM estimates up to 85% inference cost reduction by diverting queries to smaller models. Gateway-level prompt caching on Anthropic and Gemini can reduce input-token cost by up to 90% on cache hits.
What are the main routing techniques?: Four mainstream techniques: LLM-based (ask a model to classify), embedding-based (vector similarity to categories), rule-based (deterministic if/else), and ML-classifier (small dedicated model trained for routing). In production gateways these compose into declarative routing policies — fallback chains, weighted load balancing, and latency-based selection.
What is the biggest failure mode in agentic routing?: A single misclassification cascades: the wrong agent consumes tokens, produces a wrong output, and hands that to the next step. The gateway mitigation is to decouple misrouting (classification error) from outage (provider error) via explicit fallback policies so only one kind of routing failure reaches the application.
How does routing fit with Anthropic's five agent workflow patterns?: Routing is pattern #2 in Anthropic's taxonomy (Prompt Chaining, Routing, Parallelisation, Orchestrator-Workers, Evaluator-Optimizer). In practice routing sits inside the other patterns — chains route between steps, orchestrators route subtasks, evaluators route to critics. That's why a gateway-layer policy primitive is reusable across all five.