Requesty - Unified LLM Platform

Rate limits are restrictions on how frequently you can call an API or how much data you can use within a given time frame. LLM providers impose these limits to ensure their services remain stable and fair for all users, and to prevent individual users from overloading the system. In practice, a rate limit might cap the number of requests per minute or the number of tokens processed per minute for each user or organization.

For example, an API might allow only X requests per minute (RPM) or Y tokens per minute (TPM) for a given account. These limits help balance demand vs. available computational resources (like GPUs) so that no single user degrades performance for others

Different providers structure their rate limits in similar ways, usually measured as a certain number of requests or tokens in a time window. Common metrics include requests per minute (RPM), tokens per minute (TPM), and sometimes tokens per day (TPD). Limiting tokens per minute effectively controls how much text you can send/receive each minute. Some providers also enforce requests per day or have rolling 24-hour quotas to limit total daily usage. These limits are typically applied at the organization or account level (meaning all usage from your account's API keys counts towards the same quota). Without rate limits, a buggy script or malicious actor could spam the API with thousands of calls per second, so providers use these caps to protect the service.

The Requesty Router makes it easier for you to work with rate limits by implementing fallback mechanisms. Are you looking for a custom complex set-up? Send us a message, and we're happy to support.

OpenAI Rate Limits per model

OpenAI uses a combination of request and token limits that vary by model and by account tier. By default, heavier models like GPT-o1 have stricter limits than lighter models like GPT-3.5. For instance. During the initial GPT-4 beta, OpenAI set a default limit of about 200 requests per minute and 40,000 tokens per minute for the model, while GPT-3.5-turbo had a much higher default allowance (e.g. thousands of requests per minute) in comparison.

OpenAI organizes customers into usage tiers that determine their rate limits. Higher tiers – usually reached by increased monthly spending or by requesting an upgrade – come with higher RPM and TPM allowances. As the API matured and when you have a higher tier, these limits increase. A typical entry-level paid account (OpenAI usage Tier 1) might allow around 3,500 requests per minute and 200,000 tokens per minute for gpt-3.5-turbo. The more resource-intensive gpt-4 model might be limited to roughly 500 requests/minute and a lower token throughput (e.g. 10,000 tokens/min) at that same tier. For the free tier, OpenAI often imposes a cap on requests per day for certain models; for example, at one point GPT-4 was capped at about 10,000 requests per day. See their Rate Limit documentation for the latest limits.

Anthropic Rate Limits per model

Anthropic’s Claude API uses a tiered rate limit system that is closely tied to your account’s usage level. New accounts start with relatively low limits, which then increase as you move to higher tiers by spending more on the API. At the lowest tier, the limits are very modest – on the order of 50 requests per minute and in the 10s of thousands of tokens per minute depending on the model

These limits apply per organization and per model type. Notably, Anthropic defines limits for both input tokens and output tokens; for instance, a model might allow 40k input tokens and 8k output tokens per minute for a total of 48k TPM in a given tier (the exact breakdown varies by model and tier).

Tier 2 or 3 can mean hundreds of requests per minute and higher token throughput. Anthropic’s highest “Custom” tier has no fixed limits; instead, it’s a custom agreement where the limits are negotiated (essentially bound by the infrastructure capacity you arrange with Anthropic). If a request exceeds any of your Claude API limits, you’ll get a 429 error with a message like “Your account has hit a rate limit.” See their Rate Limits documentation for the latest limits.

DeepSeek Rate Limits per model

DeepSeek takes a different approach: it does not enforce explicit rate limit quotas on its users. According to DeepSeek’s API documentation, they do not constrain how many requests per minute you can send. In theory, this means you could send an unlimited number of queries without receiving a "rate limit exceeded" error. DeepSeek will “try its best” to serve every request. However, this doesn’t mean infinite instant capacity…

Instead of rejecting requests, DeepSeek handles high load by slowing down responses when necessary. If their servers are under heavy traffic, your requests may simply take longer to get a response. The API will keep the HTTP connection open and send periodic keep-alive signals (for example, empty lines for non-streaming requests, or special : keep-alive comments in the event stream for streaming requests) to let your client know the request is still being processed.

This mechanism prevents timeouts while effectively throttling the response speed. In other words, DeepSeek has implementation limitations: under extreme load you might experience significant latency. If a DeepSeek request hasn’t completed within 30 minutes, the server will close the connection. So, while there is no defined “requests per minute” cap, practical usage is limited by server capacity and the fact that very high concurrency could lead to long waits. This design is still aimed at preventing abuse, a user sending too many simultaneous requests will just find them all running very slowly, which naturally discourages excessive load.

Managing LLM Rate Limits in Your Codebase

Even with higher provider limits, it's crucial to manage how your application uses the API to avoid exceeding the allowed rates. Here are some strategies for controlling API calls and token usage within your codebase:

Introduce Delays and Scheduling: A straightforward tactic is to sleep or wait between API calls when approaching limits. For non-urgent tasks (like background processing jobs), schedule them during off-peak times or space them out. For example, if you need to process 10,000 records through an LLM, don't send 10,000 calls in a tight loop. Instead, schedule batches of requests and insert small delays (a few hundred milliseconds or seconds) between each call or batch. By pacing your workload, you stay under the radar of rate limits and also reduce stress on the API.
Restrict Certain High-Volume Operations: Analyze your application for features that could generate a lot of automated calls. For instance, if you allow users to connect the app to their social media for automatic posting or allow uploading a large dataset for processing, put some safeguards in place. You might limit the frequency of automated posts (e.g. at most one post generation per minute per user), or throttle bulk processing by chunking it over time. If you have an API endpoint in your service that in turn calls the LLM API, you might enforce an app-level rate limit on that endpoint for each API consumer. By doing so, you prevent abuse such as someone writing a script to spam your service (and thus spam the underlying LLM API). Essentially, set rules for programmatic access: if a single user tries to, say, generate content in a loop or a script tries to process 1000 items in one go, your code can detect this pattern and either deny the request or queue it for gradual processing. This not only protects you from hitting the provider’s limits but also from incurring unexpectedly high costs.
Implement a Rolling Window Counter: Instead of firing API requests as fast as possible, implement a rate limiter in your application. For example, keep track of how many requests you've sent in the last 60 seconds (a rolling time window). If you're nearing the limit (say your limit is 50 RPM and you've sent 50 requests in the last minute), have your code pause or delay further requests until the window refreshes. This ensures you never go beyond the allowed requests per minute. Many developers use token bucket or leaky bucket algorithms or simply a queue system to spread out API calls. Remember that some providers enforce sub-minute limits as well – for instance, an official 60 RPM limit might internally be enforced as "1 request per second" evenly. So, it's wise to distribute calls rather than sending a burst of 50 all at once. Smoothing out your requests over time will prevent hitting burst limits.
Batch or Queue Requests from Multiple Users: If you have multiple end-users in your application all making requests, their combined usage could break the limit. Create a central request queue that all user requests feed into. The system can then dequeue and send requests at a steady rate. This multi-tenant rate limiting ensures that even if 10 users all try to do something simultaneously, the backend will serialize or throttle those API calls to stay within, say, your 100 requests/min organization limit. You might process, for example, 10 requests every 6 seconds rather than 60 all at once.
Use provider feedback: In implementing these strategies, make use of provider feedback. Many APIs return headers with your remaining quota (for example, Anthropic’s response includes headers like anthropic-ratelimit-requests-remaining and a reset timestamp). You can read these and adjust your app’s behavior dynamically — e.g., if only 1 request is remaining for the minute, hold off new requests until the reset time. Proactively managing the flow of API calls in your codebase is key to running a smooth, resilient application.

The Requesty Router makes it easier for you to work with rate limits by implementing fallback mechanisms. Are you looking for a custom complex set-up? Send us a message, and we're happy to support.

Best Practices to Prevent Rate Limit Errors

Beyond just technical rate limiting, it's good to establish usage policies in your application to prevent hitting the API limits. Here are some best practices:

Set Per-User Usage Limits: Define how much API usage each end-user of your application is allowed, and enforce it. For example, you might allow each user 100 completions per day or 1000 per week. By capping at the user level, you reduce the chance that one heavy user drives you over the provider’s limit. These limits can reset daily, weekly, or monthly, and you should communicate them to your users. It not only helps avoid rate limit errors but also helps control your costs. If feasible, let users know their current usage and remaining quota (just like providers do) so they are aware of their limits.
Implement Hard Caps and Monitoring: It’s wise to have a safety cutoff – for instance, if normally users stay under 100 requests/day but someone suddenly has 10x that usage, you might temporarily block further requests from that user pending review. This could catch inadvertent infinite loops or malicious usage before it triggers a flood of rate limit errors. Set up alerts for unusual spikes in usage. If a certain user or process is consuming far more than average, it might be a sign of a bug or abuse. You can require a manual review or intervention when thresholds are exceeded. By curbing excessive usage proactively, you’ll rarely hit the provider’s hard limits. Essentially, treat the provider rate limit as the absolute last resort, your own application’s limits should typically be stricter so that you never actually hit the provider ceiling except in exceptional cases.
Use Multiple API Keys or Projects (if allowed cautiously): This is a more advanced strategy and must be within the provider's terms of service (always check!). Some providers allow creating multiple API keys or projects under the same account, each with its own rate limit. In certain cases, splitting traffic across keys (for example, one key per user or per module of your app) might increase the overall throughput available to you. However, be careful: often the rate limit is shared at the org level, so multiple keys won’t help if they all count toward the same org limit. And creating fake accounts to bypass limits is against terms. Only use this approach if, say, you have legitimate separate projects or you’ve cleared it with the provider that you can have distinct limits per workspace or key. When used appropriately, this can isolate usage and avoid one service impacting another’s quota.

By instituting these best practices, you create a buffer between your app’s usage and the provider’s limits, making it far less likely to ever see a 429 "Too Many Requests" error in production. It's all about being proactive: monitor usage, communicate limits, and stop problems before they start.

Retrying with Exponential Backoff

Even with careful planning, you may occasionally hit a rate limit (network hiccups or concurrency spikes happen). When a rate limit error occurs (HTTP 429 from OpenAI or Anthropic), your client should not immediately slam the server with retries. Instead, implement exponential backoff for retries.

Exponential backoff means that after encountering an error, you wait for a short random delay and retry the request; if it fails again, wait longer, then retry; and so on, increasing the wait time between each attempt. For example, on the first 429 error you might wait 1 second, if it fails again wait 2 seconds, then 4 seconds, etc., perhaps with some randomness (jitter) added to avoid synchronized retries. This approach is recommended because it gives the rate limit window time to reset and alleviates stress on the API. Many HTTP clients and libraries have built-in support or plugins for exponential backoff to handle 429/503 errors.

The benefits of using exponential backoff for LLM API calls are significant:

Automatic Recovery: Your application can recover from transient rate limit hits without manual intervention. The first few requests might get delayed, but eventually the call goes through once the rate window opens up. This way, users might experience a slight delay rather than a hard failure or crash. The process is invisible to the end-user aside from the extra wait time.
Efficient Use of Retries: By spacing out retries (longer and longer), you avoid wasting calls during the period you're still over the limit. Quick, repeated retries with no delay would likely all fail and count against your limit. Exponential backoff tries again quickly at first, but then backs off to longer waits, which increases the chance that the retry lands after your limit has reset. This maximizes the chance the retry succeeds without needlessly hammering the API.
Randomized Delay (Jitter): Introducing randomness in wait times (e.g., one client waits 2.1 seconds, another waits 2.7 seconds) prevents multiple simultaneous clients from retrying in lockstep. If all clients waited exactly 2 seconds and retried together, you could get a thundering herd that causes another immediate limit hit. Randomized exponential backoff staggers these retries, smoothing out the traffic.

When implementing backoff, also set a reasonable limit on retries. For example, you might give up after some time (say, double delays up to 30 seconds max, or stop after N attempts) so that your system doesn’t hang forever if something is persistently wrong. Logging the 429 errors and the retry attempts is useful for monitoring how often you hit limits.

One important caution: remember that failed requests still count toward your rate limit in many cases. For OpenAI, for instance, if you send a request that gets rejected with a 429, that request still consumed one request slot (and whatever tokens were in it) from your quota. So blindly retrying very fast can actually make the problem worse by eating into your limit even as you hit it. Exponential backoff mitigates this by minimizing the number of attempts. Always honor any Retry-After headers if the API provides one, it will tell you exactly how many seconds to wait before retrying. Both OpenAI and Anthropic might include a Retry-After header or specific error message indicating when you can resume. In summary, do not continuously resend requests without delay when a rate limit is reached. Back off, wait, and try again gradually. This strategy will make your integration much more robust under heavy load.

Optimizing Token Usage

Another angle for avoiding rate limit issues is to optimize how many tokens you send and receive, since token-based limits (TPM, etc.) are often the first constraint you hit with LLMs. If your application can be smart about token usage, you can stay under limits and also reduce costs. Here are some tips:

Adjust the max_tokens Setting: The max_tokens parameter (for OpenAI, and similarly max_output_tokens in Anthropic) controls the maximum length of the model’s response. Setting this value appropriately can prevent overuse of tokens. For instance, if you expect an answer to be a couple of sentences (say ~50 tokens), don’t set max_tokens to 1000. A too-high limit not only potentially wastes tokens if the model starts rambling, but it can count against your token-per-minute quota. In fact, OpenAI counts tokens toward your limit based on the maximum you allow. The API will consider the sum of your input tokens plus max_tokens as the potential tokens for the request when applying TPM limits. That means even if the model doesn’t use all those tokens, you needed to have that capacity available. By lowering max_tokens to a sensible ceiling, you reduce the token budget each request requires. This helps avoid hitting the TPM cap.
Match Token Limits to Expected Completion Size: As a best practice, try to estimate how long of a response you really need from the model and set the limits accordingly. If you are summarizing a short article, maybe 200 tokens is enough for the summary. If you are answering a yes/no question, maybe 20 tokens is enough. Giving the model a token budget that’s just above what you need creates a natural cutoff so it won’t generate unnecessary text. It also makes each response faster. Additionally, consider instructing the model to be concise if that suits your application (for example, you can prompt with "Answer in one paragraph."). A concise response uses fewer tokens, letting you serve more requests within the same token-per-minute quota.
Optimize Prompts and Context: Long prompts or conversations consume input tokens. If you have a chatbot with a conversation history, be mindful of how much of the history you send each time. Maybe you don’t need to send the entire chat history on each request, perhaps just the last few relevant turns. Trimming unnecessary tokens from prompts (like extra whitespace, irrelevant data, or overly verbose instructions) can cut down token usage significantly. Every token saved is room for another request. Using tools like OpenAI’s tiktoken library can help you programmatically count tokens so you know how big your prompts and outputs are.

By optimizing token usage, you not only avoid rate limit issues but also reduce cost (since billing is also per token). In short: be frugal with tokens. Treat the token budget like a valuable resource. This will pay off when you are operating at scale under strict token-per-minute limits.

Batching Requests for Efficiency

When dealing with large volumes of tasks, batching can be an effective way to improve throughput without hitting hard per-minute caps. Batching means handling multiple requests together either in a single API call or in an asynchronous job, rather than one-by-one in rapid succession.

Batching for Asynchronous Processing: OpenAI has introduced a Batch API specifically for high-volume processing jobs. This is a special endpoint where you can submit a large batch of requests (up to 50,000 requests in one job) to be processed asynchronously. You provide a file of requests, and OpenAI will handle them in the background, returning the results once all are processed (which could be up to 24 hours later, depending on size). The advantage is that batch jobs do not count against your normal per-minute rate limits in the same way. They are executed offline with a separate capacity allocation. In fact, OpenAI offers about a 50% cost reduction for requests made via the Batch API because it allows the work to be scheduled more efficiently on their side. This is ideal for use cases like processing a large dataset (e.g., translating thousands of documents or generating embeddings for a big text corpus). By using the Batch API, you can effectively bypass the strict request-per-minute limits and let OpenAI handle the throughput asynchronously. Of course, the trade-off is latency: you don’t get results in real-time. But if you don’t need immediate responses, this is a great way to avoid rate limits while maximizing throughput.
To use OpenAI’s Batch API, you typically prepare a JSONL file of all your requests and upload it, then initiate a batch job. The API will process them and allow you to download the results when ready. The batch job ensures the requests are handled as quickly as possible under the hood, and you can monitor its status. This is much more efficient than trying to script 50k sequential calls yourself, which would definitely hit your API limits or take a long time.
Batching Multiple Tasks in One Request (Synchronous): Sometimes you need results immediately (synchronously), but you still want to reduce the number of separate API calls. In such cases, you can batch tasks by combining them into a single request. For example, OpenAI’s completion and chat APIs allow you to send one prompt and get one response. But that prompt could instruct the model to do multiple things at once. You might send a prompt like: "1. Translate the following sentence to French. 2. Summarize the following paragraph. ..." etc., essentially requesting multiple outputs in one go. The model can return a compound answer. This way, you use one request (counting as 1 RPM) to accomplish what might have taken 2 or 3 requests otherwise. Another example: if you need embeddings for 100 sentences, OpenAI’s embeddings endpoint lets you submit all 100 in a single API call (it returns an array of embeddings). This is built-in batching support that greatly improves throughput.
However, be mindful of token limits when batching in a single request. If you cram too many tasks into one prompt, the prompt itself becomes very large, and the response might be large too, possibly hitting token limits or context length limits. So there’s a balance. For moderate batch sizes, this technique can be a big win. It reduces overhead from HTTP requests and helps stay under any requests-per-minute limit since you’re making fewer calls. Just ensure the tasks are logically related enough that combining them doesn’t confuse the model or degrade the quality of the response.
Asynchronous Scheduling and Parallelism: If your environment supports concurrency, you can also dispatch multiple requests in parallel up to the limit. For example, if you’re allowed 50 RPM, you might run 5 requests in parallel, each second, for 10 seconds, to maximize usage. This isn’t exactly batching in one request, but it’s a way to utilize your quota efficiently. Just be cautious to not exceed the token limit. E.g., 5 simultaneous large requests could collectively count toward your TPM. So you might need a token-based throttle as well.

In summary, batching is about working smarter within the limits. Use provider-specific batch features for large-scale jobs to sidestep normal rate limits on a separate channel. And for real-time needs, consider combining tasks or sending requests in bundles so you get more done per API call. Both approaches will help you increase throughput without hitting that dreaded "Too Many Requests" error as quickly.

The Requesty Router makes it easier for you to work with rate limits by implementing fallback mechanisms. Are you looking for a custom complex set-up? Send us a message, and we're happy to support.

By understanding each provider’s rate limit structure and employing these strategies (from efficient coding practices to leveraging special APIs), you can manage and even work with rate limits effectively. In practice, a combination of upgrading your tier, coding defensively, and optimizing your usage will allow you to scale your use of LLM APIs like OpenAI’s and Anthropic’s while minimizing errors. Remember that rate limits exist to protect both you and the provider. By respecting and planning for them, you ensure your application runs smoothly and your users stay happy. Effective rate limit management is now a key skill for developers integrating AI APIs, but with the tips above, you’ll be well-equipped to handle it.

Rate Limits for LLM Providers: working with rate limits from OpenAI, Anthropic, and DeepSeek

OpenAI Rate Limits per model

Anthropic Rate Limits per model

DeepSeek Rate Limits per model

Managing LLM Rate Limits in Your Codebase

Best Practices to Prevent Rate Limit Errors

Retrying with Exponential Backoff

Optimizing Token Usage

Batching Requests for Efficiency