Requesty
Back|JAN '26ROUTING / BEST PRACTICES
2 MIN READ|

Designing fallback retries: why Requesty uses 500ms → 4s with jitter

Thibault Jaigu
Thibault Jaigu
CEO & Co-Founder
Published

When a model provider returns a 429 or times out mid-stream, the instinct is to retry immediately. Don't. Under real traffic that's how you turn a short provider blip into a self-inflicted outage: every client in your fleet hammers the recovering endpoint at the same instant, and nothing comes back.

This is what our fallback policies were designed to avoid. The retry schedule inside Requesty is boring on purpose: 500ms → 1s → 2s → 4s, plus jitter, up to ten attempts, before we give up on a model and move to the next one in the chain. Here's why each piece is shaped the way it is.

The schedule itself

Exponential backoff is the default for a reason. If a provider's issue lasts 200ms, you catch it on the first retry at 500ms. If it lasts 3 seconds, you catch it at the 4s mark. The windows grow with the problem.

What matters more is what we don't do:

  • We don't retry at 50ms. That's below the latency floor of almost every LLM provider and guarantees you miss the recovery window while adding request load.
  • We don't retry forever. Ten attempts at 4s each is a hard ceiling of roughly 15 seconds before the next model in the chain takes over. Users aren't waiting a minute for a spinner.
  • We don't retry non-retryable errors. Auth failures and malformed requests skip straight to the next policy rung. Retrying a 401 is just a slower 401.

Jitter is the quiet hero

A fleet of 5,000 clients hitting the same exponential schedule in lockstep produces a thundering herd — tightly synchronized retry waves. Recovery works right up until the moment it doesn't.

Jitter decorrelates those waves. Each request's backoff gets a small random offset, so when the provider comes back, requests arrive smeared across a window rather than in a single spike. Same total requests, flatter curve, actual recovery.

What it protects against

The triggers that flip a request onto the retry path, and what they usually mean:

  • Timeout — the provider took too long. Could be capacity pressure or a single slow instance. Usually transient.
  • Rate limit — you're over quota for a few seconds. A 1-2s retry resolves most of these.
  • Generic 5xx — provider side error, no attribution. Worth one or two retries before bailing out.
  • Non-retryable (401, 400) — don't retry. Skip to the fallback model.

If all ten attempts fail on the first model, we move on. Your request doesn't die — it just ends up at the next model in your policy, and you're only charged for the one that actually returned tokens.

The shortest version

Python
from openai import OpenAI
 
client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="your-requesty-api-key"
)
 
response = client.chat.completions.create(
    model="policy/sonnet",   # your policy name
    messages=[{"role": "user", "content": "Hello!"}]
)

Point at a policy instead of a raw model. The retries, the jitter, the fallback chain — all of it runs on our side. You get one shape of response and you only pay for the one that worked.

That's the whole philosophy: retries should make failures quieter, not louder. Boring schedules, random offsets, strict ceilings. Nothing clever. See the full config in the routing policies docs.

Frequently asked questions

What is exponential backoff with jitter for LLM retries?
Exponential backoff with jitter is a retry strategy where each retry waits longer than the last (e.g. 500ms, 1s, 2s, 4s) with a random offset added to each delay. The jitter prevents all clients from retrying at the same instant, which would overwhelm a recovering provider. Requesty uses this pattern with up to 10 attempts before falling back to the next model in the chain.
Why not retry LLM requests immediately after a failure?
Immediate retries under real traffic turn a short provider blip into a self-inflicted outage. Every client hammers the recovering endpoint at the same instant, creating a thundering herd problem. Exponential backoff spaces out retries and jitter randomizes their timing, giving the provider time to recover.
How does Requesty handle LLM provider outages?
Requesty uses fallback policies with a retry schedule of 500ms to 4s with jitter, up to 10 attempts per model. If a model fails all retries, Requesty automatically moves to the next model in the fallback chain. This cross-provider failover means a single provider outage does not affect your application.