Requesty
Back|MAY '26AGENTS / BEST PRACTICES
8 MIN READ|

Agent Harness: Why Your LLM Gateway Is the Backbone of Production Agents

Thibault Jaigu
Thibault Jaigu
CEO & Co-Founder
Published

The model is the brain. The harness is the body. If you have been building AI agents in 2026, you have already noticed: the LLM call is 5% of the work. The other 95% is orchestration, tool access, memory, permissions, observability, and routing. That surrounding infrastructure is what the industry now calls the agent harness.

This post breaks down the agent harness stack, shows where an LLM gateway fits, and walks through real code examples using Requesty as the routing layer. Every snippet below runs against the live API with the OpenAI SDK you already know.

What is an agent harness?

An agent harness is the runtime infrastructure that wraps an LLM and turns raw token generation into reliable, observable, production grade actions. The term gained traction in early 2026 after Anthropic's agentic coding report and a survey paper on harness engineering formalized the architecture.

The harness handles:

  • Tool access via MCP (Model Context Protocol) for agent to tool communication
  • Inter-agent delegation via A2A (Agent to Agent Protocol) for horizontal coordination
  • Routing to pick the right model per subtask based on cost, latency, and capability
  • Memory for working context, episodic experience, and long-term knowledge
  • Permissions and sandboxing so agents cannot exceed their scope
  • Observability to track cost, latency, success rate, and failure mode per agent per task

Without the harness, you have a chatbot. With it, you have a production system.

The 2026 agent harness stack

Here is how the pieces compose in a typical production deployment:

LayerComponents
OrchestratorLangGraph, CrewAI, or custom DAG
MemoryRedis, vector store, episodic context
MCP (Tools)Tool discovery, actions, resources
A2A (Agents)Inter-agent delegation, handoffs
PermissionsScopes, approval loops, sandboxing
LLM GatewayPolicy routing, failover, caching, analytics, load balancing
ProvidersOpenAI, Anthropic, Google, DeepSeek, Mistral, xAI, Groq, Together...

The gateway sits between your orchestration logic and the providers. It is the nervous system of the harness: every LLM call flows through it, every response comes back through it, and every metric is captured there.

Why the gateway layer matters for agents

Three reasons this layer is non-negotiable for production agents:

1. Cost control. Agents make 10x to 100x more LLM calls than a simple chatbot. A code review agent might classify a diff (cheap model), generate a review (frontier model), and summarize findings (mid-tier model) all in one pass. Without routing, you burn frontier-model tokens on one-word classifications.

2. Reliability. Agents run autonomously. If Claude goes down at 3am, your agent should not crash. It should failover to the next model in the policy chain, log the event, and keep working.

3. Observability. When your agent costs $47 on Tuesday instead of $12 on Monday, you need to know which subtask spiked, on which branch, for which user. Analytics headers make this trivial.

Code: Discovering available models

Requesty exposes an OpenAI compatible /v1/models endpoint. Your agent can query it at startup to discover what is available:

Shell
curl https://router.requesty.ai/v1/models \
  -H "Authorization: Bearer $REQUESTY_API_KEY" \
  | jq '.data | length'

Response:

JSON
487

That is 487 models across 25 providers (OpenAI, Anthropic, Google, DeepSeek, xAI, Mistral, Groq, Together, Azure, Bedrock, and more) available through a single API key and base URL.

In Python with the OpenAI SDK:

Python
from openai import OpenAI
 
client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="YOUR_REQUESTY_KEY"
)
 
models = client.models.list()
providers = set(m.id.split("/")[0] for m in models.data if "/" in m.id)
print(f"{len(models.data)} models across {len(providers)} providers")

Code: Tagging requests with agent metadata

Every LLM call your agent makes can carry metadata via X-Requesty-* headers. These are stripped before reaching the provider and stored in your analytics dashboard. The pattern is completely extensible: any header matching X-Requesty-<Name> is captured automatically.

Python
from openai import OpenAI
 
client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="YOUR_REQUESTY_KEY"
)
 
response = client.chat.completions.create(
    model="openai/gpt-4.1-nano",
    messages=[
        {"role": "system", "content": "Classify: bug-fix, feature, refactor, or docs."},
        {"role": "user", "content": "Renamed getUserById to fetchUserById, added error handling."}
    ],
    max_tokens=10,
    extra_headers={
        "X-Requesty-Agent": "code-review-agent",
        "X-Requesty-Branch": "feat/auth-refactor",
        "X-Requesty-Environment": "staging",
        "X-Requesty-Task": "classify"
    }
)
 
print(response.choices[0].message.content)  # "refactor"
print(f"Cost: ${response.usage.cost:.6f}")  # Cost: $0.000006

That classification cost $0.000006. Six millionths of a dollar. Now compare that to running it on a frontier model where it would cost 100x more for the same one-word answer.

The headers you can attach include:

HeaderPurpose
X-Requesty-AgentWhich agent made the call
X-Requesty-BranchGit branch context
X-Requesty-Environmentprod, staging, dev
X-Requesty-TaskSubtask type (classify, synthesize, summarize)
X-Requesty-CustomerEnd customer attribution
X-Requesty-TeamTeam or department
X-Requesty-RepoRepository context
X-Requesty-UserDeveloper or operator

All of these become filterable dimensions in the usage analytics.

Code: Policy based routing

Hardcoding model names in your agent is fragile. If anthropic/claude-sonnet-4-5 has an outage, your agent stops. Instead, reference a named policy:

Python
response = client.chat.completions.create(
    model="policy/claude-4",
    messages=[
        {"role": "system", "content": "You are a code reviewer."},
        {"role": "user", "content": "Review this authentication refactor..."}
    ],
    extra_headers={
        "X-Requesty-Agent": "code-review-agent",
        "X-Requesty-Task": "review"
    }
)
 
print(response.model)  # "claude-sonnet-4-20250514" (resolved by the policy)

The policy claude-4 is configured in your Requesty dashboard with a fallback chain. If Claude Sonnet is down, it falls to Claude Haiku. If Anthropic is entirely down, it falls to GPT-4.1. Your agent code never changes. The gateway handles reliability.

You can create policies for different agent roles:

  • policy/classifier resolves to the cheapest model that handles classification well
  • policy/synthesizer resolves to the best available frontier model
  • policy/embedder resolves to the fastest embedding model

This pattern decouples your agent logic from provider availability and pricing changes.

Code: Tracking agent spend

After your agents run, you need to understand where the money went. The usage endpoint lets you group by any dimension you tagged:

Shell
curl -X GET "https://api-v2.requesty.ai/v1/manage/apikey/{api_key_id}/usage" \
  -H "Authorization: Bearer $REQUESTY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "start": "2026-05-01T00:00:00Z",
    "end": "2026-05-07T23:59:59Z",
    "resolution": "day",
    "group_by": ["model_requested", "extra.agent", "extra.task"]
  }'

This returns spend, token counts, and request counts grouped by model, agent name, and task type. You can answer questions like:

  • "Which agent spent the most this week?"
  • "What percentage of our budget goes to classification vs synthesis?"
  • "Did the feat/auth-refactor branch cost more than main?"

Group by extra.branch to see per-branch costs:

Shell
curl -X GET "https://api-v2.requesty.ai/v1/manage/apikey/{api_key_id}/usage" \
  -H "Authorization: Bearer $REQUESTY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "start": "2026-05-01T00:00:00Z",
    "end": "2026-05-07T23:59:59Z",
    "resolution": "day",
    "group_by": ["extra.branch"]
  }'

The extra.* pattern is unlimited. Any custom metadata you attach via headers becomes a groupable dimension in analytics.

Code: A complete multi-model agent harness

Here is a full working example of an agent harness that uses cheap models for classification and frontier models for synthesis. This is the pattern that saves 50% or more on LLM costs:

Python
from openai import OpenAI
from typing import Literal
 
client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="YOUR_REQUESTY_KEY"
)
 
AGENT_NAME = "code-review-agent"
BRANCH = "feat/auth-refactor"
 
def classify(diff: str) -> Literal["bug-fix", "feature", "refactor", "docs"]:
    """Classify a code change. Uses a nano model because this is a one-word answer."""
    response = client.chat.completions.create(
        model="openai/gpt-4.1-nano",
        messages=[
            {"role": "system", "content": "Classify: bug-fix, feature, refactor, or docs. One word only."},
            {"role": "user", "content": diff}
        ],
        max_tokens=10,
        extra_headers={
            "X-Requesty-Agent": AGENT_NAME,
            "X-Requesty-Branch": BRANCH,
            "X-Requesty-Task": "classify"
        }
    )
    return response.choices[0].message.content.strip().lower()
 
 
def review(diff: str, category: str) -> str:
    """Generate a detailed code review. Uses a frontier model for quality."""
    response = client.chat.completions.create(
        model="policy/claude-4",
        messages=[
            {"role": "system", "content": f"You are a senior code reviewer. This is a {category}. "
                                          "Provide actionable feedback in 3 bullet points."},
            {"role": "user", "content": diff}
        ],
        max_tokens=500,
        extra_headers={
            "X-Requesty-Agent": AGENT_NAME,
            "X-Requesty-Branch": BRANCH,
            "X-Requesty-Task": "review"
        }
    )
    return response.choices[0].message.content
 
 
def summarize(review_text: str) -> str:
    """One-line summary for the PR comment. Mid-tier model is fine."""
    response = client.chat.completions.create(
        model="google/gemini-2.5-flash",
        messages=[
            {"role": "system", "content": "Summarize this code review in one sentence."},
            {"role": "user", "content": review_text}
        ],
        max_tokens=100,
        extra_headers={
            "X-Requesty-Agent": AGENT_NAME,
            "X-Requesty-Branch": BRANCH,
            "X-Requesty-Task": "summarize"
        }
    )
    return response.choices[0].message.content
 
 
diff = """
- def getUserById(id):
-     return db.query(User).filter(User.id == id).first()
+ def fetchUserById(id):
+     try:
+         return db.query(User).filter(User.id == id).first()
+     except DatabaseError as e:
+         logger.error(f"Failed to fetch user {id}: {e}")
+         raise
"""
 
category = classify(diff)
review_text = review(diff, category)
summary = summarize(review_text)
 
print(f"Category: {category}")
print(f"Review:\n{review_text}")
print(f"Summary: {summary}")

This agent makes three LLM calls to three different models through a single gateway. The classification call costs fractions of a cent. The review call uses the best available Claude model via policy routing. The summary uses a fast mid-tier model. All three calls carry the same agent metadata so you can track total cost per run in your dashboard.

Why bounded workflows beat swarms

Despite the theoretical appeal of autonomous multi-agent swarms, production data tells a clear story. Teams shipping real products prefer:

  • Single, well-scoped agents with explicit boundaries over chaotic multi-agent coordination
  • Human-in-the-loop checkpoints at critical decision points rather than fully autonomous chains
  • Deterministic orchestration (if/else, DAGs) over LLM-decided routing for high-stakes paths
  • Cost caps per agent run so a misrouting loop cannot burn $500 in tokens overnight

The agent above is a bounded workflow. It has three steps, each with a clear model choice, and the total cost is predictable and observable via the gateway analytics. If the classify step returns an unexpected value, the harness catches it at the orchestration layer before spending money on a review.

Gartner projects 40% of enterprise applications will include task-specific AI agents by end of 2026. "Task-specific" is the key word. Not autonomous swarms. Bounded, observable, cost-controlled agents with a proper harness around them.

Building your agent harness with Requesty

Here is a practical checklist for shipping production agents on Requesty:

  1. Set up model policies in your dashboard for each agent role (classifier, synthesizer, summarizer)
  2. Tag every LLM call with X-Requesty-Agent, X-Requesty-Task, and X-Requesty-Branch
  3. Use nano/flash models for classification and frontier models for synthesis
  4. Monitor daily with the usage endpoint grouped by extra.agent and extra.task
  5. Set budget alerts so runaway agents get caught before they cost real money
  6. Enable auto-caching for repeated system prompts (up to 90% input token savings)

The gateway is not the whole harness. But it is the layer that makes every other layer observable, reliable, and cost-efficient. Memory, MCP, A2A, and orchestration all sit above it. The gateway is what keeps the lights on when providers have outages and what tells you where your money is going when agents run autonomously.

Start building at requesty.ai. The OpenAI SDK you already use works out of the box. Change two lines (base URL and API key) and your agents get routing, failover, caching, and per-call analytics for free.

Frequently asked questions

What is an agent harness?
An agent harness is everything around the LLM that makes it useful in production: orchestration, memory, tool access (MCP), inter-agent communication (A2A), permissions, observability, and routing. The model generates tokens. The harness turns tokens into actions.
Where does an LLM gateway fit in the agent harness stack?
The LLM gateway is the routing and observability layer of the harness. It decides which model handles each request, tracks cost and latency per agent, provides failover when providers go down, and feeds analytics back into the orchestration layer.
How do analytics headers help with agent observability?
By attaching metadata like X-Requesty-Agent, X-Requesty-Branch, and X-Requesty-Task to every LLM call, you get per-agent, per-task, per-branch cost and latency breakdowns without changing your application code.
What is policy-based routing?
Instead of hardcoding a model name, you reference a named policy (e.g. model: policy/prod). The gateway resolves this to the best available model based on your configured fallback chain, latency targets, and cost constraints. This decouples your agent code from provider availability.
Should I use multi-agent swarms or bounded workflows?
Production data overwhelmingly favors bounded workflows: single, well-scoped agents with explicit human-in-the-loop checkpoints. Swarms are interesting for research but introduce cascading failures, unpredictable cost, and debugging nightmares at scale.
Related reading