Designing fallback retries: why Requesty uses 500ms → 4s with jitter

When a model provider returns a 429 or times out mid-stream, the instinct is to retry immediately. Don't. Under real traffic that's how you turn a short provider blip into a self-inflicted outage: every client in your fleet hammers the recovering endpoint at the same instant, and nothing comes back.

This is what our fallback policies were designed to avoid. The retry schedule inside Requesty is boring on purpose: 500ms → 1s → 2s → 4s, plus jitter, up to ten attempts, before we give up on a model and move to the next one in the chain. Here's why each piece is shaped the way it is.

The schedule itself

Exponential backoff is the default for a reason. If a provider's issue lasts 200ms, you catch it on the first retry at 500ms. If it lasts 3 seconds, you catch it at the 4s mark. The windows grow with the problem.

What matters more is what we don't do:

We don't retry at 50ms. That's below the latency floor of almost every LLM provider and guarantees you miss the recovery window while adding request load.
We don't retry forever. Ten attempts at 4s each is a hard ceiling of roughly 15 seconds before the next model in the chain takes over. Users aren't waiting a minute for a spinner.
We don't retry non-retryable errors. Auth failures and malformed requests skip straight to the next policy rung. Retrying a 401 is just a slower 401.

Jitter is the quiet hero

A fleet of 5,000 clients hitting the same exponential schedule in lockstep produces a thundering herd — tightly synchronized retry waves. Recovery works right up until the moment it doesn't.

Jitter decorrelates those waves. Each request's backoff gets a small random offset, so when the provider comes back, requests arrive smeared across a window rather than in a single spike. Same total requests, flatter curve, actual recovery.

What it protects against

The triggers that flip a request onto the retry path, and what they usually mean:

Timeout — the provider took too long. Could be capacity pressure or a single slow instance. Usually transient.
Rate limit — you're over quota for a few seconds. A 1-2s retry resolves most of these.
Generic 5xx — provider side error, no attribution. Worth one or two retries before bailing out.
Non-retryable (401, 400) — don't retry. Skip to the fallback model.

If all ten attempts fail on the first model, we move on. Your request doesn't die — it just ends up at the next model in your policy, and you're only charged for the one that actually returned tokens.

The shortest version

Python

from openai import OpenAI
 
client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="your-requesty-api-key"
)
 
response = client.chat.completions.create(
    model="policy/sonnet",   # your policy name
    messages=[{"role": "user", "content": "Hello!"}]
)

Point at a policy instead of a raw model. The retries, the jitter, the fallback chain — all of it runs on our side. You get one shape of response and you only pay for the one that worked.

That's the whole philosophy: retries should make failures quieter, not louder. Boring schedules, random offsets, strict ceilings. Nothing clever. See the full config in the routing policies docs.