You ship an LLM feature. It works in staging. It works for the first hundred users. Then someone reports that the chatbot "feels slow." Someone else says it "stopped working." Your cost dashboard shows a spike but you have no idea which feature caused it.
This is the observability gap that catches most teams. They instrument LLM calls the way they instrument REST APIs: count the calls, measure the latency, move on. But LLM calls are not REST calls. A 200 response can contain a useless answer. A "fast" response can still feel slow if the first token takes 8 seconds. And a single prompt change can double your token spend without changing anything in your infrastructure metrics.
This post covers what to actually measure in production, how to debug the most common issues, and what a useful LLM observability setup looks like in practice.
The five metrics that matter
Most teams start with two metrics: total cost and call count. These are fine for a monthly report. They are useless for debugging.
Here are the five metrics that actually help you operate an LLM feature:
1. Time to first token (TTFT)
TTFT is the time between sending a request and receiving the first token back from the model. For streaming applications, this is the metric your users feel. A 200ms TTFT feels instant. A 5 second TTFT feels broken, even if the total response only takes 8 seconds.
TTFT is also your best early warning signal for provider degradation. When a provider starts having issues, TTFT climbs before error rates do. The provider is not failing yet, it is just getting slow. If you only watch error rates, you will miss this.
Track TTFT at p50, p95, and p99. The median tells you the typical experience. The tail tells you how bad it gets for unlucky users.
2. Tokens per second (TPS)
TPS measures how fast the model generates output once it starts. This matters most for long responses: code generation, document summarization, multi step reasoning.
A sudden drop in TPS on the same model usually means one of two things: the provider is under load, or you changed something in the prompt that triggered heavier reasoning. If you are using a model with extended thinking (like Claude with thinking tokens), TPS on the visible output will look artificially low because the model is spending time on internal reasoning tokens that you may not see in the stream.
3. Cost per session
Total cost per day is a billing metric. Cost per session (or per user action, or per agent run) is an operational metric. The difference matters.
If your daily cost doubles, the question is: did usage double (fine) or did cost per session double (investigate). Without per session cost, you cannot tell.
Label your API keys by feature or team. If your chatbot, your code assistant, and your document processor all use the same key, you are flying blind when costs spike. Requesty lets you label API keys for exactly this reason.
4. Error rate by type
A single "error rate" metric hides everything useful. Break it down:
- Rate limit errors (429): You are sending too much traffic for your tier. Either upgrade, add retry logic with backoff, or route overflow to an alternative provider.
- Timeout errors: The model is taking too long. Check whether your timeout is appropriate for the model and task. Frontier reasoning models routinely take 30 seconds or more.
- Content filter errors: The model refused to respond. This could be a prompt injection attempt, or it could be your prompt accidentally triggering a safety filter. Log the prompt (or a hash of it) so you can investigate.
- Server errors (500/502/503): The provider is having issues. If this happens on one provider, your gateway should failover to another. If you do not have failover, you should.
Each of these has a completely different remediation path. Lumping them into one number helps no one.
5. Cache hit rate
If you are using prompt caching (and you should be for any application with a stable system prompt), cache hit rate is a cost and latency multiplier.
Anthropic's prompt caching gives you a 90% discount on cached input tokens and faster TTFT. But caches expire. If your cache hit rate drops from 95% to 40%, your effective cost just tripled on those requests and your TTFT got slower. This can happen silently when you deploy a new system prompt or when traffic patterns change.
Track cache hit rate per model and per route. A global average will hide a feature that lost its cache.
Debugging common production issues
With these five metrics in place, here is how to investigate the issues that actually come up in production.
"The chatbot feels slow"
- Check TTFT. If TTFT is above 3 seconds for a streaming app, the user is staring at a blank screen. This is a provider issue, a model selection issue (reasoning models are slower to start), or a prompt size issue (larger prompts take longer to process before generation starts).
- Check TPS. If TTFT is fine but the response is streaming slowly, the model is generating slowly. This often happens during peak hours on shared inference endpoints.
- Check finish reason. If finish reason is
length, the model hit the max token limit. The response may be truncated, and the user is seeing an incomplete answer that "feels" slow because it just stops.
"Costs spiked overnight"
- Check cost per session. Did per session cost increase, or did volume increase? Volume increases are usually good news.
- Check cache hit rates. A new deployment might have changed the system prompt, invalidating the cache. This alone can 3x your input token cost.
- Check token counts. Are output tokens per request higher? Maybe a prompt change is making the model more verbose. Are input tokens higher? Maybe your RAG pipeline is injecting more context than before.
- Filter by API key label. Which feature or team caused the spike?
"The model keeps returning errors"
- Break down by error type. 429s, 500s, and content filters have completely different root causes.
- Check if it is provider specific. If errors only happen on one provider, your failover should be handling it. If it is not, check your failover configuration.
- Check the time pattern. Errors that spike at specific hours usually correlate with rate limits being hit during peak usage. Errors that spike after a deployment usually correlate with a prompt change triggering content filters.
"The agent is stuck in a loop"
- Count tool calls per session. A healthy agent run might make 5 to 15 tool calls. If you see 50+, the agent is likely stuck.
- Check finish reason distribution. If you see a high proportion of
tool_callsfinish reasons followed by the same tool being called repeatedly, the agent is not making progress. - Check cost per session. Agent loops are the single fastest way to burn money on LLM calls. A budget cap per session is essential.
What a useful dashboard looks like
A dashboard that actually helps you operate an LLM feature has three panels:
Real time panel (last hour):
- TTFT p50 and p95, by model
- Error rate by type
- Active sessions count
- Cost accumulation rate (dollars per minute)
Operational panel (last 24 hours):
- Success rate by provider
- Cache hit rate by model
- Cost per session distribution (histogram, not average)
- Token count trends (input vs output)
Business panel (last 30 days):
- Cost per user action trend
- Total spend by API key label
- Model usage distribution
- Cost efficiency (useful completions per dollar)
The first panel is for incidents. The second is for daily operations. The third is for planning.
If you are running through Requesty, the live logs and analytics dashboard give you most of this out of the box without custom instrumentation. You get per request TTFT, cost, token counts, finish reasons, cache hit status, and provider attribution for every call.
The instrumentation trap
The biggest mistake teams make is building custom observability infrastructure before they have a gateway. You end up writing middleware to capture timing, parse streaming responses for token counts, correlate requests with costs, and forward everything to your metrics stack. That is weeks of engineering for something that should be a feature of your routing layer.
A gateway like Requesty sits in the request path by design. It already knows the TTFT, the token counts, the cost, the provider, the cache status, and the finish reason. Instrumenting at the gateway level gets you everything above with zero application code changes.
If you are currently calling LLM providers directly and struggling with observability, adding a gateway is the highest leverage move you can make. Not for the routing, not for the failover (although those help too), but because you get comprehensive telemetry as a side effect of routing.
Getting started
If you are starting from zero, here is the shortest path to useful LLM observability:
- Route through a gateway. This gets you per request telemetry without writing instrumentation code.
- Label your API keys by feature. This lets you attribute costs and debug issues per feature rather than per organization.
- Set up three alerts: TTFT p95 above your threshold, error rate above 5%, and daily cost above 120% of your trailing average.
- Review cost per session weekly. This catches prompt bloat, agent loops, and cache misses before they become expensive.
That is it. Four steps, no custom dashboards, no metrics pipeline. You can add sophistication later, but these four will catch 90% of production LLM issues before your users report them.
You can get started with Requesty's observability features with $10 in free credits and zero configuration beyond changing your base URL.
Frequently asked questions
- What is LLM observability?
- LLM observability is the practice of collecting, correlating, and acting on operational telemetry from your LLM calls in production. It goes beyond basic logging to include per request metrics like time to first token, tokens per second, cost per session, error classification, and model level success rates. Good LLM observability lets you answer why a specific user had a bad experience, not just whether your system is up.
- What metrics should I track for LLM calls in production?
- The five essential metrics are time to first token (TTFT) for streaming UX, tokens per second (TPS) for throughput, cost per session or per user action, error rate broken down by type (rate limit, timeout, content filter, server error), and cache hit rate if you use prompt caching. Total call count and average latency are vanity metrics without these breakdowns.
- How do I debug a slow LLM response in production?
- Start with TTFT versus total latency. If TTFT is high but total is normal, the provider is slow to start generating. If TTFT is normal but total is high, the response is unusually long or the model is generating excessive reasoning tokens. Check the finish reason. If it is length, the model hit the max token limit and you may be truncating useful output. Check whether prompt caching was active for that request. A cache miss on a large system prompt can add seconds.
- How does Requesty help with LLM observability?
- Requesty provides built in observability for every request routed through the gateway. You get per token cost tracking, TTFT and total latency metrics, finish reason classification, cache hit rates, and live log streaming. You can filter by model, provider, API key label, and time range without adding any instrumentation to your application code.
- MAY '26
What the gateway saw in April 2026: agents live on Anthropic, open-source models got fast, and the latency gap is 14×
A read of the per-provider operational data from Requesty's gateway in April 2026. Anthropic-direct serves twice as many tool calls as the next provider. Open-source aggregator routes are 9-14× faster than they were a year ago. p50 latency between fastest and slowest providers spans 15×.
- MAR '26
New: spend alerts for LLM traffic — webhooks when budgets get hit
Requesty Alerts are live: JSON and Slack webhook notifications when a user, group, or organisation crosses a spend threshold. Four alert types, built-in retries, zero application code required.

