Contents

Why This Matters Now Three Eras The Core Formula Harness Anatomy Five Layers Named Failure Modes Decision Architecture Benchmark Proof AGENTS.md Pattern Implementation Guide Multi-Agent Harnesses Maturity Model Prebuilt vs. Custom Environment Engineering References
0%
May 2026  ·  AI Engineering  ·  Deep Dive

Prompting
to Harness
Engineering

The model is commodity. The harness is moat. 88% of enterprise AI agents fail in production — not because the model is wrong, but because the environment around it is broken.

Haroon K M
Haroon K M haroontrailblazer.vercel.app
Agent = Model + Harness
~26 min read
TL;DR — The Five Claims of This Article
This is a technical deep-dive into harness engineering — the discipline formalized in February 2026 that treats the environment around a language model as the primary engineering variable for AI agents.
  • The model accounts for at most 15% of production performance variance. The harness accounts for the rest.
  • All major agent failure modes (victory bias, context anxiety, doom loops, grep-spree, goal drift) are harness-solvable problems — not model limitations.
  • A production harness has five layers: tool orchestration, verification loops, context & memory, guardrails & permissions, and observability. Missing any one creates systematic failures.
  • LangChain's Terminal Bench 2.0 result — +25 rank positions, zero model changes — is the controlled proof. Every other benchmark confirms it.
  • The next frontier is Environment Engineering: redesigning internal systems to be inherently AI-legible, reducing the harness burden rather than building around complexity debt.

88% of enterprise AI agents never make it to production.

That number, cited across multiple 2026 AI engineering reports, is jarring. Not 20%. Not 40%. Nearly nine out of ten. And the failure isn't what you'd expect. It isn't the model. The model can write code, plan tasks, use tools. The model isn't the problem.

The problem is everything else. Context rot. Absent guardrails. No state persistence. Feedback loops that never close. Agents that declare victory without checking their work. Systems that run beautifully in a demo and catastrophically in production.

The dominant assumption from 2023 through most of 2025 was that the model itself was the determining variable. Teams optimized around the next model release. By late 2025, that assumption had quietly collapsed. The performance gap between frontier models from different labs had narrowed substantially. The gap between the same model with and without a well-designed harness had widened dramatically. The harness became the actual variable.

In February 2026, three converging publications crystallized what practitioners had been building without a shared vocabulary: Mitchell Hashimoto's personal blog post, Ryan Lopopolo's field report from OpenAI, and a cascade from LangChain, Anthropic, and Thoughtworks. A discipline was named: Harness Engineering.

What changed in 2025–2026

Frontier model capability gaps narrowed. GPT-4-class intelligence became commoditized. The marginal return on a model upgrade dropped; the marginal return on harness investment rose sharply.

Result: engineering effort shifted from prompt crafting to environment design — and teams that made that shift earlier outperformed those still optimizing prompts.

What "harness" means precisely

Everything in an AI agent except the model itself. The tools it can use, the permissions it's granted, the state it persists, the tests that verify its work, the logs that make it observable, the guardrails that keep it safe, and the recovery mechanisms that handle failures.

Source: Birgitta Böckeler, Thoughtworks / Martin Fowler, April 2026.

Three eras. One trajectory.

AI engineering maturity has moved through three distinct phases, each one revealing a new bottleneck that the previous phase couldn't solve. Click each era to explore the shift.

ERA 01
Prompt Engineering
ERA 02
Context Engineering
ERA 03
Harness Engineering
💬
2022 – 2023 Bottleneck: Language

Prompt Engineering: Talking to the Model

The early discovery: phrasing matters enormously. "Write a function that..." produces different output than "You are a senior engineer. Implement a function that...". The whole discipline focused on language as the lever.

What it is
Crafting the instruction

Single-turn optimization. One question, one answer. Role-playing, few-shot examples, chain-of-thought cues. Engineers competed to find the magic phrase.

What it produced
Prompt libraries & templates

Teams maintained "golden prompts." Performance was fragile — a model update could break a carefully tuned prompt. The knowledge was linguistic, not architectural.

Where it worked
Single-turn tasks

Drafting, summarizing, simple Q&A. Tasks that fit in one exchange with stable, small context worked well. Multi-step, state-dependent, and domain-specific tasks broke it.

Where it broke down
Context became the ceiling

Once models got more capable, the bottleneck shifted from how you ask to what information the model has. Better prompts on the wrong context = better-worded wrong answers.

🧠
2024 – 2025 Bottleneck: Information

Context Engineering: Teaching the Model What to Know

Coined by Andrej Karpathy in December 2025, though practised much earlier. The focus shifted from wording to curation: what files, rules, schemas, and constraints go into the model's context window?

What it is
Curating what the model sees

RAG pipelines, MCP servers, structured documentation, codebase maps. Engineers stopped asking "how do I say this?" and started asking "what does the model need to know?"

Key techniques
Dynamic context loading

Semantic retrieval, structured system prompts, CLAUDE.md / AGENTS.md files, per-tool schema injection. Context became a curated artifact, not an afterthought.

Where it worked
Domain-aware reasoning

Coding assistants with codebase context, customer service bots with product knowledge, research agents with document retrieval. Quality improved dramatically with the right context.

Where it broke down
Autonomy without control

Once models had the right information, the challenge became autonomy and reliability. Agents knowing the right thing didn't mean they did the right thing — or stopped when they should.

🏗️
2026 → Bottleneck: Control

Harness Engineering: Designing the Environment

Formalized by Mitchell Hashimoto in February 2026. The model is taken as given. The question is: what is the full system of constraints, feedback loops, tools, and verification that turns raw capability into reliable, autonomous production work?

What it is
Engineering the environment

Every agent failure becomes a permanent engineering fix. AGENTS.md grows. CI gates harden. Linters enforce. Verification loops close. The harness gets smarter each time an agent breaks.

The formula
Agent = Model + Harness

Coined by LangChain after OpenAI's field report: the model supplies intelligence, the harness makes that intelligence useful and reliable. Swap the harness; transform the agent.

Primary outputs
Autonomous end-to-end tasks

Full PRs, multi-session refactors, cross-system migrations. Work that previously required sustained human attention, delegated to agents with structured accountability.

The meta-skill
Harness reasoning

When an agent fails, the question is never "which prompt fixes this?" but "which layer is absent? Which guardrail is missing? Which loop doesn't close?" This is the reasoning mode of harness engineering.

PhasePeriodCore QuestionBottleneck SolvedPrimary Output
Prompt Engineering 2022 – 2023 How do I phrase this? Language Code snippets, boilerplate
Context Engineering 2024 – 2025 What should the model know? Information Context-aware feature code
Harness Engineering 2026 → What can the model do — and not do? Control Autonomous end-to-end tasks

The model is not the variable.

This is the most important reframe in AI engineering in 2026. Swap the model for a competitor and output quality shifts perhaps 10–15%. Change the harness and you change whether the system works at all.

Model + Harness = Agent
The equation that defined AI engineering in 2026 — coined by LangChain, formalized from OpenAI's field report

Anytime you find an agent makes a mistake, you take the time to engineer a solution so that the agent never makes that mistake again.

— Mitchell Hashimoto, February 2026

Days later, Ryan Lopopolo from OpenAI's technical staff published the most cited field report in AI engineering that year. His team had spent five months building a production product with a single hard constraint: zero manually-written lines of code. Every function, every test, every CI script, every piece of documentation — all generated by Codex agents.

1M+
Lines of code, zero written by hand
1,500
Merged pull requests in 5 months
3–10×
Engineer throughput multiplier

The tagline: "Humans steer. Agents execute." Engineers stopped writing code. They started designing the environment that made reliable code generation possible. When something failed, the fix was almost never "write a better prompt." It was: "What capability, context, or structure is missing from the harness?"

The harness engineering definition from Martin Fowler's site (Birgitta Böckeler, April 2026) captures it precisely: "Everything in an AI agent except the model itself. The tools it can use, the permissions it's granted, the state it persists, the tests that verify its work, the logs that make it observable, the guardrails that keep it safe, and the recovery mechanisms that handle failures."

Inside the harness: components, patterns, requirements.

A harness is not a monolithic object. It is a layered system of components, each with a distinct responsibility. The following explorer details what each component does, what patterns it uses, and what metrics signal its health.

Harness Component Explorer
Click each component to explore implementation detail
Interactive
Tool Registry
L1 — Orchestration
Verification
L2 — Correctness
Context Mgmt
L3 — Memory
Guardrails
L4 — Safety
Observability
L5 — Visibility
State Engine
L6 — Persistence

Tool Registry & Orchestration

The tool registry is the complete manifest of capabilities available to the agent. It determines what the agent can do. Orchestration governs how those capabilities get chained into coherent workflows.

4–5
Tools per bounded task
3
Retry tiers before escalation
<200ms
Tool dispatch overhead
  • Required Atomic tool design: each tool does one thing. Composite tools hide failure modes.
  • Required Typed tool schemas with JSON Schema validation. Reject malformed calls before execution.
  • Recommended Tool selection rationale logging — record why each tool was called for trace analysis.
  • Recommended Exponential backoff on transient failures with structured error payloads returned to the agent.
  • Advanced Dynamic tool loading: inject domain-specific tools only for the task type detected, not all tools for all tasks.
  • Advanced Tool budgets: cap maximum calls per tool per session to prevent runaway loops.

Verification Loops

Verification that fires during execution — not just at the end. Intermediate checkpoints with structured pass/fail signals are what make self-correction practical. End-of-task verification alone is nearly useless for multi-step work.

±40%
Pass rate improvement, intermediate vs. terminal
3
Verification tiers (syntax, semantic, integration)
  • Required Completion criterion check: the agent must verify the stated objective before declaring done.
  • Required Syntax-level verification (linters, type checkers) as blocking gates — not advisory warnings.
  • Recommended Semantic verification: test suite execution with structured output fed back as agent context.
  • Recommended Regression lock: before a task completes, verify all pre-existing tests still pass.
  • Advanced Differential verification: compare the agent's stated change against actual diff to detect undisclosed modifications.
  • Advanced Cross-agent verification: route high-stakes outputs to a separate evaluator agent before committing.

Context Management

Context is finite and expensive. A harness engineers context as a first-class resource. The two failure modes are symmetric: too little (agent lacks what it needs) and too much (attention diluted, context anxiety sets in).

~100 lines
AGENTS.md target size
75%
Context fill threshold for compaction
todo.md
Goal re-anchor pattern
  • Required AGENTS.md / CLAUDE.md: a structured entry-point (~100 lines), not an encyclopedia. Table of contents with deep-link pointers.
  • Required Proactive compaction: summarize context at 75% fill threshold. Don't wait for anxiety to set in.
  • Recommended todo.md pattern: the agent re-reads its plan at each checkpoint, preventing goal drift over long tasks.
  • Recommended Tiered context loading: minimal system context always-on; domain context loaded on demand; session state reconstructed from checkpoints.
  • Advanced Semantic deduplication: detect and collapse redundant context before injection using embedding similarity.
  • Advanced Conversational memory distillation: after each session, extract and persist a structured summary for multi-session continuity.

Guardrails & Permissions

The core architectural distinction: telling an agent "follow our coding standards" in a prompt relies on probabilistic compliance. Wiring a linter that blocks the PR when standards are violated enforces deterministic constraints. One is advisory. The other is structural.

2-stage
Classifier design (fast gate + CoT)
<7%
Target human approval rate for non-trivial actions
  • Required Least-privilege tool access: the harness defines what tools exist, not the agent. Agent cannot self-provision capabilities.
  • Required Feedforward constraints: rules files, type systems, architectural lint — reducing solution space before generation.
  • Required Feedback enforcement: post-generation violation detection that triggers structured error recovery loops.
  • Recommended Two-stage approval gates: fast single-token classifier first; chain-of-thought reasoning only on flagged actions. Avoids approval fatigue.
  • Recommended Reversibility scoring: rate each planned action on a reversibility scale; require higher approval thresholds for irreversible operations.
  • Advanced Dynamic permission tiers: escalate permission requirements based on task complexity, session history, and detected confidence signals.

Observability

Full session replay, per-turn token attribution, subagent execution trees, and complete cost tracking. Every tool call, every context load, every verification result, every guardrail trigger — in a structured, queryable trace. Without this, debugging is archaeology.

5 planes
Trace, eval, cost, guardrail, perf
P50/P95
Latency SLOs per tool type
  • Required Structured trace format: every turn emits a structured log with turn ID, tool calls, token counts, and outcome.
  • Required Failure attribution: each failure is tagged to a layer (orchestration, verification, context, guardrail) not just flagged as "agent error."
  • Recommended Session replay: ability to replay any session from trace data, substituting different harness parameters to test fixes.
  • Recommended Cost attribution per task type: understand token consumption patterns across your task distribution to optimize context loading.
  • Advanced Automated harness improvement: the observability layer feeds a harness-improvement agent that proposes AGENTS.md additions from failure patterns.
  • Advanced Cross-session aggregate analysis: identify patterns across sessions (not just within) to catch systemic harness gaps invisible in individual traces.

State Engine & Persistence

Context windows are ephemeral. A state engine persists what matters across turns and sessions — structured checkpoints, decision logs, task state, and environmental snapshots — making multi-session autonomous work tractable.

3 layers
Working / session / long-term memory
checkpoint
State snapshot at phase gates
  • Required Phase-gate checkpointing: persist structured state at each phase boundary, enabling resume after interruption.
  • Required Decision log: every significant architectural decision is persisted with rationale, preventing the agent from re-litigating settled choices.
  • Recommended Three-tier memory: working memory (current turn), session memory (current task), long-term memory (cross-session knowledge).
  • Recommended Environmental snapshot: capture file system state, dependency versions, and tool configurations at session start for reproducibility.
  • Advanced Anthropic Managed Agents pattern: virtualize session (durable event log), sandbox (disposable container), and harness loop as separate components with independent lifecycle management.

A production harness has six layers.

A production-grade harness is a layered system. Missing any layer doesn't degrade performance gracefully — it creates systematic failure modes that look like model problems but are harness deficits. Click each layer to expand the detail.

L1
Tool Orchestration
The central nervous system

Tool orchestration is what transforms the model from a passive text generator into an autonomous actor capable of multi-step workflows. It governs how the agent accesses environments (secure file systems, shells, internal APIs) and how intelligently it chains these utilities together.

Robust orchestration includes dynamic error handling — the ability to recognize when something went wrong, pivot strategy, and recover without human intervention. This is where most naive agent implementations fail: no error recovery, no retry logic, no graceful degradation. The agent hits a wall and either halts or spirals.

Pattern Error Recovery Hierarchy

Tier 1: retry with same tool and parameters (transient failures). Tier 2: retry with modified parameters or alternative tool. Tier 3: escalate to human with structured error context. Tier 4: checkpoint state and halt cleanly. Agents without this hierarchy either give up immediately or loop forever.

The tool registry defines what the agent can do. Making it narrow and atomic by default — 4–5 tools per bounded task — is standard practice. Scope creep in tool access is as dangerous as scope creep in code. Tool proliferation creates combinatorial planning complexity that degrades agent decision quality.

L2
Verification Loops
Catching failures before they become facts

Verification loops are automated quality-assurance steps that fire during execution, not just at the end. This distinction matters enormously. A final verification that fires after the agent has made 40 interdependent changes gives almost no useful signal and is nearly impossible to recover from. Intermediate verification at each meaningful checkpoint is what makes self-correction practical.

Anthropic's research identified "victory declaration bias" as one of the most consistent failure modes: agents frequently mark tasks complete without verifying the outcome. A verification loop that actually checks the stated completion criterion — and returns a structured error if it fails — closes this loop deterministically.

Implementation Three-Tier Verification

Tier 1 (Syntax): linters, type checkers, formatters — blocking gates, not warnings. Tier 2 (Semantic): test suite execution; structured pass/fail fed back as agent context. Tier 3 (Integration): end-to-end smoke tests verifying the change works in a real environment. Each tier gates progression to the next.

Spotify's Honk system and Apiiro's 2025 analysis both converge on the same conclusion: without active verification loops, AI-generated code volume scales faster than quality can be maintained. The verification layer is what makes volume sustainable.

L3
Context & Memory
Treating context as a curated, perishable resource

Context is finite and expensive. A harness engineers context as a first-class resource, not an afterthought. The two failure modes are symmetric: too little context (the agent lacks what it needs) and too much context (the agent's attention is diluted, or "context anxiety" sets in as the window fills).

Anthropic identified "context anxiety" as a distinct failure mode: as the context window approaches its limit, models begin rushing toward completion, cutting corners, and making lower-quality decisions to avoid running out of space. The harness fix is proactive context compaction and structured checkpointing — not just hoping the model stays rational.

Critical The 75% Rule

Trigger proactive context compaction at 75% fill, not 95%. At 95%, context anxiety is already affecting output quality. Compaction at 75% gives the model enough breathing room to reason clearly about summarization — and produces better summaries as a result.

The canonical pattern is a short AGENTS.md or CLAUDE.md file (~100 lines) as the entry point — not an encyclopedia but a table of contents, with pointers to deeper docs that get pulled on demand. The todo.md pattern has the agent re-read its plan at each step, preventing goal drift across a long task. These aren't prompt tricks. They're harness conventions.

L4
Guardrails & Permissions
Deterministic constraints over probabilistic compliance

The core architectural distinction of harness engineering: telling an agent "follow our coding standards" in a prompt is fundamentally different from wiring a linter that blocks the PR when standards are violated. The first relies on probabilistic compliance. The second enforces deterministic constraints.

Guardrails operate at two levels. Feedforward controls (constraint harnesses) reduce the agent's solution space before generation begins — rules files, type systems, architectural lint configurations, permission boundaries. Feedback controls (enforcement harnesses) detect violations after the fact and trigger structured error recovery loops.

Design Reversibility-Weighted Approval

Every planned action gets a reversibility score (1–5). Score 1 (writing a local file) requires no approval. Score 3 (modifying a shared service config) requires soft confirmation. Score 5 (deleting data, making external API calls with side effects) requires explicit human approval regardless of context. The harness enforces this scoring, not the model's self-assessment.

Anthropic's two-stage classifier design for Claude Code is instructive: a fast single-token gate first, chain-of-thought reasoning only on flagged actions. This avoids "approval fatigue" (users approving 93% of prompts reflexively makes approvals meaningless) while maintaining real oversight for genuinely high-risk actions.

L5
Observability
You cannot improve what you cannot see

Observability in a harness context means full session replay, per-turn token attribution, subagent execution trees, and complete cost tracking. It spans all other layers — every tool call, every context load, every verification result, every guardrail trigger — in a structured, queryable trace.

Without this, debugging is archaeology. Teams know something went wrong; they don't know which layer failed, which tool call was the inflection point, or why the agent took the path it did. With it, failure analysis becomes systematic.

Emerging Standard The Unified Observability Plane

AgentOps, Future AGI, and Anthropic's managed agents all converge on the same architecture: a unified plane that collapses tracing, evals, cost tracking, and guardrail telemetry into a single feedback loop. The goal is to close the distance from "agent failure" to "harness fix" from days to hours to (eventually) minutes.

LangChain's improvement from Top 30 to Top 5 on Terminal Bench 2.0 was driven significantly by using LangSmith tracing at scale to identify failure modes and iteratively optimize the harness — not by intuition, but by data from the observability layer.

L6
State Engine & Persistence
Making multi-session work tractable

Context windows are ephemeral. A state engine persists what matters across turns and sessions — structured checkpoints, decision logs, task state, and environmental snapshots. Without it, every new session starts from scratch, making long-horizon autonomous work impossible.

The three-tier memory model is emerging as the standard: working memory (current turn context), session memory (structured state for the current task), and long-term memory (cross-session knowledge about the environment, conventions, and past decisions).

Pattern Decision Log Anti-Pattern

One of the most underappreciated failure modes: an agent makes an architectural decision in session one, and in session three revisits and reverses it — because the decision wasn't persisted. Decision logs that record not just what was decided but why prevent this regression and eliminate wasted cycles relitigating settled choices.

Anthropic's Managed Agents (April 2026) virtualizes exactly these three components: the session (a durable event log outside the context window), the sandbox (a disposable container where code runs), and the harness loop itself. This pattern is quickly becoming the reference architecture for stateful agent systems.

Named failure modes. Harness-level fixes.

Anthropic's research and OpenAI's field report both identified systematic failure modes that are inherent to how language models reason — but solvable at the harness level. Understanding them by name is the first step to fixing them architecturally.

FAILURE 01
● Model Bias

Victory Declaration Bias

Agents frequently mark tasks complete without verifying the outcome. Estimated to account for 30–40% of agent failures in production code tasks.

Harness Fix

Mandatory completion criterion check: a verification loop that inspects the actual artifact against the stated objective before the task is marked done. Blocked by CI if verification is skipped.

FAILURE 02
● Context Failure

Context Anxiety

As context fills toward the limit, models rush toward completion, cutting corners, making lower-quality decisions, and hallucinating resolutions.

Harness Fix

Proactive context compaction at the 75% fill threshold with structured checkpoint injection. The harness, not the model, decides when to compact.

FAILURE 03
● Planning Failure

One-Shotting Overreach

Agents try to solve entire problems in a single execution, producing undocumented tangles of interdependent changes that are impossible to review or roll back.

Harness Fix

Phase gates enforced by the harness: design phase → implementation phase → test phase → review phase. Phase transitions require explicit verification outputs.

FAILURE 04
● Loop Failure

Doom Loops

Agents get stuck in repetitive cycles — trying the same broken approach over and over. Token budgets exhaust. No useful work is produced.

Harness Fix

Middleware loop-detection hooks that identify identical or near-identical tool call sequences across turns, inject a circuit breaker, and redirect strategy before budget is exhausted.

FAILURE 05
● Search Failure

Grep-Spree

Without a structured environment map, agents burn massive tokens blindly searching for files or data — often finding what they need on the 15th attempt.

Harness Fix

A well-structured AGENTS.md with explicit directory conventions and a codebase map. Agents can navigate purposefully instead of exploring exhaustively.

FAILURE 06
● Drift Failure

Goal Drift

Over long multi-session tasks, agents lose track of the original objective, gradually drifting toward sub-goals that feel locally important but diverge from the actual target.

Harness Fix

The todo.md pattern: the agent re-reads its stated plan at each checkpoint, re-anchoring to the primary objective before proceeding to the next phase.

FAILURE 07
● Decision Failure

Autonomous Irreversibility

Agents make irreversible architectural decisions without the context to evaluate long-term consequences — decisions that look locally correct but create large downstream technical debt.

Harness Fix

Reversibility scoring applied to every planned action. High-reversibility-impact actions trigger a decision.md checkpoint and human review before execution.

FAILURE 08
● Scope Failure

Helpful Overreach

Agents, in attempting to be maximally helpful, "fix" things beyond the stated scope — introducing unreviewed changes that break other functionality.

Harness Fix

Explicit scope boundaries in AGENTS.md combined with diff-gating: any change outside declared scope files triggers a scope violation warning and human confirmation.

These aren't model quality problems. They're architectural absences. Better prompting addresses none of them. Each has a deterministic harness-level fix that makes the failure mode structurally harder to repeat — not just less likely in any given run.

Decision architecture: feedforward and feedback control.

Harness engineering borrows a precise framing from control theory. A well-designed harness uses both feedforward and feedback control — two complementary mechanisms that together produce robust agent behavior.

⬆️

Feedforward Control

Constraint Harnesses — Before generation

Reduce the agent's solution space before any output is generated. The agent cannot produce certain classes of error because the harness prevents those paths from being accessible.

  • Rules files and architectural lint configurations (hard constraints on patterns the agent may use)
  • Type systems and schema validation (invalid types become unrepresentable)
  • Permission boundaries (tools not in the registry cannot be called)
  • AGENTS.md conventions (directory, naming, and scope rules)
  • Scope boundary declarations (explicit allowed-file manifests)
🔁

Feedback Control

Enforcement Harnesses — After generation

Detect violations in the agent's output and trigger structured correction loops. The agent can still make mistakes; the harness ensures those mistakes are caught and corrected before they become permanent.

  • Post-generation linting with structured error payloads fed back to the agent
  • Test suite execution with failure context as agent input
  • Scope violation detection with diff analysis
  • Loop detection middleware that identifies repetition patterns
  • Victory declaration interceptors that require actual verification

The optimal harness uses both. Feedforward controls are more efficient — they prevent errors entirely. But they can't cover every case. Feedback controls are your safety net — they catch what feedforward missed. Neither alone is sufficient. Together, they create defense in depth against every known agent failure mode.

The OWASP LLM06:2025 "Excessive Agency" risk framing provides the checklist for feedforward control: over-provisioned functions, unnecessary permissions, and missing approval mechanisms are the attack surface. The harness is the defense. Every entry in that checklist maps directly to a harness engineering decision.

Same model. Radically different agent.

The most cited empirical proof of harness engineering's impact is LangChain's Terminal Bench 2.0 results in early 2026. They ran a controlled experiment: improve the agent's performance without changing the underlying model. The result settled the debate.

Terminal Bench 2.0 — LangChain deepagents-cli
Model: gpt-5.2-codex (unchanged across both runs)
Controlled
Before harness work
52.8%
Rank: Outside Top 30
After harness work
66.5%
Rank: Top 5 → +25 positions
  • System prompts emphasizing self-verification loops — agents check their own work before declaring completion (targets Victory Declaration Bias)
  • Enhanced context injection — structured environment maps so agents understand their working directory without blind searching (targets Grep-Spree)
  • Middleware loop-detection hooks — circuit breakers that fire when the agent enters a repetitive pattern and redirect strategy (targets Doom Loops)
Notable: running at maximum reasoning budget scored worse (53.9%) due to timeout failures — more model compute does not substitute for harness quality. The harness matters more than the reasoning budget.

The independent validation from Ewan Mak's team reinforces this: working with a financial services client's codebase, the same Claude Sonnet 4.6 model went from a 58% pass rate to 81% — purely through harness changes. No weight updates. No model swap. Two weeks of harness engineering: rewriting the system prompt for monorepo layout, adding subagent delegation, wiring linter output back as middleware observations.

Model swap (same harness)
~15%
Harness optimization only
~70%
Model + harness together
~92%
More reasoning budget (no harness)
~8%

Indicative ranges from 2026 practitioner reports. Exact gains depend on baseline harness quality and task domain.

Every line of this file is a past failure.

Mitchell Hashimoto's Ghostty terminal repository became the canonical reference implementation. Its AGENTS.md file is a living artifact: each rule corresponds to a specific past agent failure that's now prevented. The file grows incrementally. Each addition makes the failure mode structurally harder to repeat.

# AGENTS.md — Ghostty pattern (reconstructed with annotations) # This file is version-controlled. Every rule maps to a specific past failure. # Rule additions go through PR review. Deletions require explicit justification.
## Environment Setup - Build with: `zig build` (not cmake, not make, not any other build tool) # ← Agent used cmake on first unguided run. Now physically impossible to mistake. - Test with: `zig build test` before marking any task complete # ← Agent skipped tests on sessions 3 and 7. Now CI gate — PR cannot merge without test proof. - Never modify files in `vendor/` or `deps/` directories # ← Agent modified vendored dep. Broke reproducibility for the whole team. - If build fails, read the error fully before retrying. Do not retry without understanding. # ← Agent blind-retried 11 times on a config error that the first error message described clearly.
## Code Conventions - All public functions must have doc comments — no exceptions # ← Agent shipped 14 undocumented public APIs in one PR. Team couldn't onboard to them. - `catch unreachable` is only valid in test files. Never in production paths. # ← Agent used catch unreachable in a production error handler. Panic in prod. - Do not use `@import("std").log` in library code. Use the logger abstraction. # ← Added after agent bypassed the structured logging layer. Lost structured fields.
## Scope Control (Failure: One-Shotting Overreach) - Maximum 300 lines changed per PR. Split larger changes. # ← Agent one-shotted a 2,100-line change. Physically impossible to review meaningfully. - Structure every feature as: design PR → implementation PR → test PR # ← Agent mixed design decisions and implementation. Created unmaintainable coupling. - If a change touches more than 5 files, create an architecture note first. # ← Agent made broad structural changes across 23 files without a design doc.
## Verification (Failure: Victory Declaration Bias) - Run `zig build test` AND `zig build lint` before marking any task done. # ← Victory declaration bias. Now a blocking CI requirement, not a request. - If a test fails, diagnose before fixing. Write the diagnosis as a comment. # ← Agent pattern-matched fixes without understanding failures. Masked bugs. - If unsure about a decision, create `docs/decisions/DECISION-YYYYMMDD.md` and stop. # ← Agent made irreversible architectural choices. Decision log now prevents this.
## Context (Failure: Goal Drift) - Re-read `todo.md` at the start of each session and before each phase transition. # ← Agent drifted from primary objective across 3 sessions. todo.md re-anchor prevents this. - If context is > 75% full, summarize state to `checkpoints/` before continuing. # ← Context anxiety caused agent to hallucinate a resolution at 92% context fill.

This isn't a system prompt. It's a harness configuration document. The difference matters: a system prompt is probabilistic guidance. An AGENTS.md wired to CI gates is a deterministic constraint. One is advisory. The other is structural.

The pattern scales. OpenAI's team enforced "taste invariants" — a small set of rules encoding team engineering standards — as hard CI failures, not warnings. The harness grows each time an agent makes a novel mistake. The team gets smarter without the model changing.

An AGENTS.md that hasn't changed in two weeks isn't mature — it's either perfect (unlikely) or the team isn't learning from agent failures (more likely). A healthy AGENTS.md grows with every novel failure. The rate of new rule additions is a leading indicator of how actively the team is practicing harness engineering.

Building a production harness: the implementation sequence.

Harness engineering is not a monolithic project. It's an incremental discipline. The following sequence reflects what teams building production agent systems in 2026 found actually works — ordered by leverage, not complexity.

WEEK 1–2
Foundation: AGENTS.md + Basic Verification — Write the initial AGENTS.md from your first agent session. Every observed failure becomes a rule. Wire basic linting as a blocking CI gate. These two steps alone eliminate 40–50% of recurring failures.
Start here
WEEK 3–4
Control: Guardrails + Permission Tiers — Define the tool registry explicitly. Implement reversibility scoring. Set up the two-stage approval classifier for high-stakes actions. Map your compliance requirements to feedforward constraints.
L4
WEEK 5–8
Memory: Context Management + State Engine — Implement the 75% compaction trigger. Set up todo.md re-anchor pattern. Build phase-gate checkpointing. Add decision log infrastructure. Begin tiered context loading.
L3+L6
WEEK 9–12
Visibility: Observability Plane — Instrument structured traces across all layers. Build session replay capability. Set up cost attribution per task type. Connect the failure-to-harness-improvement feedback loop.
L5
MONTH 4+
Scale: Multi-Agent + Domain Evals — Build orchestrator patterns for parallel work. Develop custom evaluation datasets from your actual task distribution. Begin automated harness improvement from observability signals.
Advanced

The most common mistake: trying to build all six layers simultaneously before any layer works well. A deeply engineered L1–L2 (orchestration + verification) produces more reliable agents than a thin implementation of all six. Depth on the first two layers is the highest-leverage starting point.

A practical harness audit checklist — the questions to ask before declaring a harness production-ready:

Multi-agent harnesses: orchestrating agent fleets.

Single-agent harnesses are the foundation. The emerging frontier is harnesses that coordinate multiple specialized agents working in parallel on shared tasks — and the harness complexity grows non-linearly with agent count.

🎯

Orchestrator Pattern

One agent manages many

An orchestrator agent breaks down complex tasks, delegates subtasks to specialized workers, and aggregates results. The harness must manage inter-agent communication, state sharing, and conflict resolution.

Anthropic's three-agent harness study — Planner, Generator, Evaluator — on a 2D retro game engine task produced significantly better results than a solo unconstrained agent on the same task and model. The structured hand-offs and explicit evaluator role removed the victory declaration bias from the generation agent.

  • Planner receives the goal and produces a structured subtask manifest
  • Generator executes individual subtasks against the manifest
  • Evaluator verifies each output before the Planner advances
🔄

Parallel Agent Pattern

Many agents, shared workspace

Multiple agents work simultaneously on different parts of a shared codebase. The harness must enforce workspace isolation, coordinate writes, and prevent conflicts — essentially a distributed system problem applied to AI systems.

Open research problems: orchestrating hundreds of agents in parallel on a shared codebase without write conflicts. Harnesses that dynamically assemble tools and context just-in-time for a given subtask. Agents that analyze their own traces to propose harness-level fixes — closing the improvement loop autonomously.

  • File-level locking prevents write conflicts
  • Dependency graph analysis routes agents to non-conflicting work
  • Merge coordination reviews parallel outputs before integration
Multi-agent trust model

In a multi-agent system, inter-agent communication is an attack surface. Anthropic's guidance: treat messages from sub-agents with the same skepticism as messages from untrusted users. Agent authority is granted by the harness, not by another agent.

This prevents "agent impersonation" — a pattern where a compromised or hallucinating sub-agent claims elevated permissions it doesn't have, and a naive orchestrator grants them.

Managed Agents pattern (April 2026)

Anthropic's Managed Agents virtualizes three components: the session (durable event log outside the context window), the sandbox (disposable container where code runs), and the harness loop itself.

This covers L1, L6, and part of L5. The remaining layers — guardrails tuned to your domain and observability integrated with your systems — remain the custom work every team must do.

The harness engineering maturity model.

Teams building AI agents in 2026 tend to cluster into five maturity levels. Each level is defined by what the team treats as the primary variable for improvement. Most teams discover they're at Level 1 or 2 — and mistakenly believe they're at Level 3.

LEVEL 1
Prompt Tinkering. When agents fail, the team tries different prompts. The bottleneck is language. Success is accidental and non-reproducible. Model upgrades feel like the primary lever.
Most teams
LEVEL 2
Context Curation. The team builds RAG pipelines and maintains system prompt libraries. AGENTS.md exists but is rarely updated. Verification is manual. CI has no AI-specific gates.
Common
LEVEL 3
Harness Awareness. The team understands the five layers. AGENTS.md is actively maintained. At least two layers (typically L1 + L2) are production-grade. Failures generate harness tickets, not prompt retries.
Target 2026
LEVEL 4
Harness Engineering. All six layers are implemented and continuously improved. The observability plane drives harness changes. Custom eval datasets benchmark real task distributions. Multi-agent patterns are in production.
Advanced
LEVEL 5
Environment Engineering. The team redesigns internal systems to be inherently AI-legible. The harness burden decreases as the environment complexity decreases. Agents analyze their own traces and propose harness improvements. The improvement loop is largely autonomous.
Frontier

The key diagnostic question for any team: "When an agent fails, what's the first thing you change?" If the answer is "the prompt," you're at Level 1. If the answer is "the context," Level 2. If the answer is "which layer of the harness is missing or broken," you're practicing harness engineering. The question itself is the signal.

Prebuilt harnesses vs. custom harnesses.

Most AI coding agents ship with a default harness already built in. Claude Code is the clearest example: file read/write tools, terminal command execution, a multi-step execution loop, permission controls that prompt for human approval before risky actions. That default harness is what makes it an agent rather than a chatbot.

📦

Prebuilt Harness

Claude Code, Cursor, Copilot Workspace

General-purpose execution capability out of the box. Tools, basic orchestration, a default permission model, standard verification. This is the commodity layer.

  • Production-ready for general tasks immediately
  • MCP extension points for custom tools
  • Maintained and updated by the platform
  • Integrated observability (LangSmith, Anthropic traces)
  • Battle-tested across millions of agent sessions

  • No knowledge of your domain constraints or conventions
  • No enforcement of your compliance requirements
  • No integration with your internal audit systems
  • Generic verification — doesn't know your test suite structure
🔧

Custom Harness Layer

Your organizational accountability layer

Domain-specific tools, custom evals, permission models tuned to your risk posture. Built on top of the prebuilt platform, not instead of it. This is the moat.

  • Proprietary compliance linters enforced as CI gates
  • Audit logging to internal regulatory systems
  • Domain-specific tool registry and permission tiers
  • Custom evals benchmarked on your actual task distribution
  • AGENTS.md encoding your team's accumulated failure-learning

  • Requires 4–12 weeks to build a production-grade layer
  • Ongoing maintenance as the prebuilt platform evolves
  • Requires dedicated harness engineering investment
Decision Buy (Prebuilt) Build (Custom)
Tool orchestration runtime ✓ Buy
Basic permission controls ✓ Buy
Managed agent sandboxing ✓ Buy
General observability telemetry ✓ Buy
Domain-specific tool registry ✓ Build
Compliance-specific guardrails ✓ Build
Internal audit log integration ✓ Build
AGENTS.md (living document) ✓ Build
Custom evaluation datasets ✓ Build

The strategic recommendation from the 2026 landscape: buy the commodity plumbing (managed runtimes, basic telemetry, control planes) and build the proprietary integrations — domain-specific tools, custom evaluation datasets, internal environment maps, compliance-specific guardrails. The prebuilt harness provides general-purpose reliability. The custom layer provides organizational accountability. Both are necessary. Neither replaces the other.

Environment Engineering: inverting the problem.

The strategic horizon emerging in mid-2026 inverts the entire framing. Instead of building smarter harnesses to navigate messy legacy systems, forward-looking organizations are re-architecting their internal APIs, codebases, and databases to be inherently legible to AI agents.

The harness cost reveals the environment cost. If your codebase requires a sophisticated harness to navigate, that's a signal that the codebase itself has complexity debt.

— Emerging principle, AI Platform Engineering teams, 2026

This is Environment Engineering — and it's the next competitive frontier. Clean, well-documented, consistently structured systems need less harness scaffolding to produce reliable agent work. The inverse relationship is reliable: harness complexity is a proxy for environment complexity.

Environment Engineering patterns

Consistent naming conventions enforced by linters so agents can infer structure without searching. Explicit API contracts maintained as machine-readable specs. Modular architectures that create natural subtask boundaries.

Test suites designed for agents — rapid, unambiguous feedback. Fast unit test cycles rather than slow integration tests as the primary verification signal.

CNCF's four pillars

The CNCF's four pillars of platform control — golden paths, guardrails, safety nets, and manual review — are emerging as the design principles that every production harness will implement.

The convergence of CI/CD tooling, agent harnesses, and platform engineering into a unified discipline is the trajectory of the field. Platform engineering becomes AI infrastructure engineering.

The long-term trajectory: organizations that invest in environment engineering today are building a compounding advantage. A cleaner environment requires less harness, which reduces engineering cost, which allows more agent capacity for productive work, which accelerates the flywheel. Harness quality today determines environment quality tomorrow.

The bottom line

The model is table stakes.
The harness is the moat.

Every team building AI agents in 2026 is making a strategic bet. Most are optimizing for the model. The teams that will win are optimizing for the harness.

88%
of enterprise agents never reach production
+25
benchmark rank positions, zero model changes
~85%
of performance variance is the harness, not the model
Verification loops
Catch failures during execution — not after 40 interdependent changes have already compounded.
Deterministic guardrails
Enforce standards through CI gates and linters — not probabilistic prompt guidance that agents can drift from.
Structured observability
Turn every failure into a permanent harness fix. Each AGENTS.md rule is a past mistake that can't recur.

Humans steer. Agents execute. But the harness is what decides whether that steering produces reliable results — or just faster failures at scale.

— Haroon K M, May 2026
🛠 Visit Haroon K M

Primary sources