Requesty
Back|JAN '26ROUTING / BEST PRACTICES
4 MIN READ|

Routing policies 101: fallback, load balancing, and latency in production

Thibault Jaigu
Thibault Jaigu
CEO & Co-Founder
Last updated

A routing policy is a named, reusable rule that tells an LLM gateway how to pick a model, how to handle failures, and how to split traffic — accessed by name in your app code (model="policy/prod") so strategy changes never require a redeploy. Requesty's gateway exposes three policy types: fallback for resilience, load balancing for weighted splits, and latency for speed optimisation. This post covers what each one does, when to use it, and how they compose.

Most production setups use all three in one stack.

The three policy types, at a glance

PolicyPicks byBest forDocs
FallbackOrder + failure detectionResilience, cost cascadesfallback-policies
Load balancingWeighted split (deterministic hash)A/B tests, canaries, multi-vendorload-balancing-policies
LatencyRolling-window speed measurementLatency-sensitive trafficlatency-routing

1. Fallback — resilience and cost cascades

Fallback is the one you implement first. Primary model is tried; on timeout, rate limit, 429, or 5xx the router moves to the next model in the chain. Each step can have 0–10 retries with exponential backoff (500ms → 1s → 2s → 4s, ±10% jitter) before the router gives up and escalates.

Two canonical uses:

  • Multi-provider redundancy. Claude on Bedrock → Claude on Vertex → Claude direct from Anthropic. One provider goes down, you never see the outage.
  • Cost-tier cascade. Haiku → Sonnet → Opus. Cheap model handles 90% of traffic; only the hard 10% escalates.

A subtle detail worth naming: you only pay for the call that actually returns tokens. Retried-and-failed attempts incur zero token cost — so a wide fallback chain isn't expensive, it's defensive.

More on the retry schedule: Designing fallback retries — why Requesty uses 500ms → 4s with jitter.

2. Load balancing — weighted splits, done right

Load balancing sends traffic across models by weight (70/20/10). The gateway's job is to split deterministically: Requesty uses xxhash on the request's trace_id so the same user reliably hits the same model on every turn. That preserves multi-turn conversation context — you don't have Claude mid-conversation suddenly become GPT-5.

Three canonical uses:

  • A/B testing between models — is Sonnet or GPT-5 better on your support traffic? Split 50/50, read the metrics, decide with data.
  • Canary rollouts — 95% to the stable policy, 5% to the new one. Weights are live-editable, so you ramp as confidence grows.
  • Vendor diversification — if one provider is 30% cheaper this quarter and good enough for your workload, split 60/40 and capture the saving without locking in.

Policies compose. A load-balancing policy can split between two other policies, not just models — that's how canary rollouts of entire routing strategies work without orchestration code.

3. Latency — route to whoever's fastest right now

Latency policies measure model performance on a rolling ~1-hour window and route each request to the fastest model currently available. For streaming calls the metric is time-to-first-token; for non-streaming calls it's total response time. New or cold-start models get 5–10% of traffic to collect data, then join the ranking.

When to use it: interactive traffic where perceived speed matters more than cost savings — chat, autocomplete, reactive UIs. Not great for batch or background jobs where latency doesn't move the product metric.

When not to use it: when model quality varies significantly between the candidates. A latency policy will happily route to the fastest model even if it's worse at your task. Pair it with a quality guardrail or constrain the candidate list.

Composing the three

Here's a production setup that uses all three at once. One policy, one line of application code, three layers of behaviour:

YAML
# Policy: prod
# Type: load-balancing
# Weights:
#   90% → policy/prod-stable
#   10% → policy/prod-experimental
 
# Policy: prod-stable  (fallback)
#   ├─ policy/fastest-claude          (latency-based within Claude variants)
#   └─ anthropic/claude-sonnet-4-5    (retries: 2)
 
# Policy: fastest-claude  (latency)
#   candidates:
#     bedrock/claude-sonnet-4-5@us-east-1
#     bedrock/claude-sonnet-4-5@eu-west-1
#     vertex/claude-sonnet-4-5@europe-west1

Your application just calls:

Python
client.chat.completions.create(
    model="policy/prod",
    messages=[...],
)

90% of traffic goes to the stable chain, which picks the fastest Claude region, with a direct Anthropic fallback. 10% of traffic goes to the experimental chain — which you can rewrite at any time without touching the application.

That's the whole point of policy-based routing: your retry and failover logic stops being application code. It becomes config you edit in a dashboard.

FAQ on when to pick which

  • "I want the cheapest setup that still works when a provider goes down" → fallback, Haiku → Sonnet → Opus.
  • "I want to measure whether Sonnet or GPT-5 is better for my workload" → load balancing, 50/50, read the analytics.
  • "My users complain chat feels slow" → latency, candidate list of roughly equivalent models.
  • "I'm rolling out a new system prompt and want to hedge" → load balancing between two policies, 95/5 then ramp.
  • "I'm GDPR-bound and need region-locked failover" → fallback chain of EU-region models, served from the EU endpoint.

A well-designed gateway makes every one of those a config change, not a code change. That's what policies are for.

Frequently asked questions

What is a routing policy?
A routing policy is a named, reusable rule that tells an LLM gateway how to pick a model, what to do on failure, and how to split traffic. Application code references it by name (e.g. model='policy/prod') so you can change strategy without redeploying.
When should I use a fallback policy vs load balancing?
Use fallback when you care about surviving failures: primary model first, backups in order, automatic retry with exponential backoff and jitter. Use load balancing when you want to split traffic between models by weight (e.g. 70/20/10 for A/B tests or canaries). They compose — you can load-balance between two fallback chains.
How does latency-based routing pick the fastest model?
Requesty's latency policy measures a rolling ~1-hour window of time-to-first-token for streaming calls or total response time for non-streaming calls, and routes to whichever model is fastest right now. Cold-start models get 5–10% of traffic to gather initial data, and stale entries age out so the router adapts as providers degrade or improve.
Can policies reference other policies?
Yes. A fallback policy can have another policy as one of its steps. A load-balancing policy can split between two fallback chains. That's how you build canary rollouts — 90% stable policy / 10% experimental — without writing orchestration code.
Do failed fallback attempts cost money?
No. You only pay for the model call that actually returned tokens. Retries on a model that timed out, rate-limited, or returned a 5xx are free — the router only charges for the successful response at the end of the chain.
Related reading