Agent Harnessing: From Prompting to Harness Engineering

TL;DR — The Five Claims of This Article

This is a technical deep-dive into harness engineering — the discipline formalized in February 2026 that treats the environment around a language model as the primary engineering variable for AI agents.

The model accounts for at most 15% of production performance variance. The harness accounts for the rest.
All major agent failure modes (victory bias, context anxiety, doom loops, grep-spree, goal drift) are harness-solvable problems — not model limitations.
A production harness has five layers: tool orchestration, verification loops, context & memory, guardrails & permissions, and observability. Missing any one creates systematic failures.
LangChain's Terminal Bench 2.0 result — +25 rank positions, zero model changes — is the controlled proof. Every other benchmark confirms it.
The next frontier is Environment Engineering: redesigning internal systems to be inherently AI-legible, reducing the harness burden rather than building around complexity debt.

Why this matters now

88% of enterprise AI agents never make it to production.

That number, cited across multiple 2026 AI engineering reports, is jarring. Not 20%. Not 40%. Nearly nine out of ten. And the failure isn't what you'd expect. It isn't the model. The model can write code, plan tasks, use tools. The model isn't the problem.

The problem is everything else. Context rot. Absent guardrails. No state persistence. Feedback loops that never close. Agents that declare victory without checking their work. Systems that run beautifully in a demo and catastrophically in production.

The dominant assumption from 2023 through most of 2025 was that the model itself was the determining variable. Teams optimized around the next model release. By late 2025, that assumption had quietly collapsed. The performance gap between frontier models from different labs had narrowed substantially. The gap between the same model with and without a well-designed harness had widened dramatically. The harness became the actual variable.

In February 2026, three converging publications crystallized what practitioners had been building without a shared vocabulary: Mitchell Hashimoto's personal blog post, Ryan Lopopolo's field report from OpenAI, and a cascade from LangChain, Anthropic, and Thoughtworks. A discipline was named: Harness Engineering.

What changed in 2025–2026

Frontier model capability gaps narrowed. GPT-4-class intelligence became commoditized. The marginal return on a model upgrade dropped; the marginal return on harness investment rose sharply.

Result: engineering effort shifted from prompt crafting to environment design — and teams that made that shift earlier outperformed those still optimizing prompts.

What "harness" means precisely

Everything in an AI agent except the model itself. The tools it can use, the permissions it's granted, the state it persists, the tests that verify its work, the logs that make it observable, the guardrails that keep it safe, and the recovery mechanisms that handle failures.

Source: Birgitta Böckeler, Thoughtworks / Martin Fowler, April 2026.

The evolution

Three eras. One trajectory.

AI engineering maturity has moved through three distinct phases, each one revealing a new bottleneck that the previous phase couldn't solve. Click each era to explore the shift.

ERA 01

Prompt Engineering

ERA 02

Context Engineering

ERA 03

Harness Engineering

💬

2022 – 2023 Bottleneck: Language

Prompt Engineering: Talking to the Model

The early discovery: phrasing matters enormously. "Write a function that..." produces different output than "You are a senior engineer. Implement a function that...". The whole discipline focused on language as the lever.

What it is

Crafting the instruction

Single-turn optimization. One question, one answer. Role-playing, few-shot examples, chain-of-thought cues. Engineers competed to find the magic phrase.

What it produced

Prompt libraries & templates

Teams maintained "golden prompts." Performance was fragile — a model update could break a carefully tuned prompt. The knowledge was linguistic, not architectural.

Where it worked

Single-turn tasks

Drafting, summarizing, simple Q&A. Tasks that fit in one exchange with stable, small context worked well. Multi-step, state-dependent, and domain-specific tasks broke it.

Where it broke down

Context became the ceiling

Once models got more capable, the bottleneck shifted from how you ask to what information the model has. Better prompts on the wrong context = better-worded wrong answers.

🧠

2024 – 2025 Bottleneck: Information

Context Engineering: Teaching the Model What to Know

Coined by Andrej Karpathy in December 2025, though practised much earlier. The focus shifted from wording to curation: what files, rules, schemas, and constraints go into the model's context window?

What it is

Curating what the model sees

RAG pipelines, MCP servers, structured documentation, codebase maps. Engineers stopped asking "how do I say this?" and started asking "what does the model need to know?"

Key techniques

Dynamic context loading

Semantic retrieval, structured system prompts, CLAUDE.md / AGENTS.md files, per-tool schema injection. Context became a curated artifact, not an afterthought.

Where it worked

Domain-aware reasoning

Coding assistants with codebase context, customer service bots with product knowledge, research agents with document retrieval. Quality improved dramatically with the right context.

Where it broke down

Autonomy without control

Once models had the right information, the challenge became autonomy and reliability. Agents knowing the right thing didn't mean they did the right thing — or stopped when they should.

🏗️

2026 → Bottleneck: Control

Harness Engineering: Designing the Environment

Formalized by Mitchell Hashimoto in February 2026. The model is taken as given. The question is: what is the full system of constraints, feedback loops, tools, and verification that turns raw capability into reliable, autonomous production work?

What it is

Engineering the environment

Every agent failure becomes a permanent engineering fix. AGENTS.md grows. CI gates harden. Linters enforce. Verification loops close. The harness gets smarter each time an agent breaks.

The formula

Agent = Model + Harness

Coined by LangChain after OpenAI's field report: the model supplies intelligence, the harness makes that intelligence useful and reliable. Swap the harness; transform the agent.

Primary outputs

Autonomous end-to-end tasks

Full PRs, multi-session refactors, cross-system migrations. Work that previously required sustained human attention, delegated to agents with structured accountability.

The meta-skill

Harness reasoning

When an agent fails, the question is never "which prompt fixes this?" but "which layer is absent? Which guardrail is missing? Which loop doesn't close?" This is the reasoning mode of harness engineering.

Phase	Period	Core Question	Bottleneck Solved	Primary Output
Prompt Engineering	2022 – 2023	How do I phrase this?	Language	Code snippets, boilerplate
Context Engineering	2024 – 2025	What should the model know?	Information	Context-aware feature code
Harness Engineering	2026 →	What can the model do — and not do?	Control	Autonomous end-to-end tasks

The core insight

The model is not the variable.

This is the most important reframe in AI engineering in 2026. Swap the model for a competitor and output quality shifts perhaps 10–15%. Change the harness and you change whether the system works at all.

Model + Harness = Agent

The equation that defined AI engineering in 2026 — coined by LangChain, formalized from OpenAI's field report

Anytime you find an agent makes a mistake, you take the time to engineer a solution so that the agent never makes that mistake again.

— Mitchell Hashimoto, February 2026

Days later, Ryan Lopopolo from OpenAI's technical staff published the most cited field report in AI engineering that year. His team had spent five months building a production product with a single hard constraint: zero manually-written lines of code. Every function, every test, every CI script, every piece of documentation — all generated by Codex agents.

1M+

Lines of code, zero written by hand

1,500

Merged pull requests in 5 months

3–10×

Engineer throughput multiplier

The tagline: "Humans steer. Agents execute." Engineers stopped writing code. They started designing the environment that made reliable code generation possible. When something failed, the fix was almost never "write a better prompt." It was: "What capability, context, or structure is missing from the harness?"

The harness engineering definition from Martin Fowler's site (Birgitta Böckeler, April 2026) captures it precisely: "Everything in an AI agent except the model itself. The tools it can use, the permissions it's granted, the state it persists, the tests that verify its work, the logs that make it observable, the guardrails that keep it safe, and the recovery mechanisms that handle failures."

Deep technical anatomy

Inside the harness: components, patterns, requirements.

A harness is not a monolithic object. It is a layered system of components, each with a distinct responsibility. The following explorer details what each component does, what patterns it uses, and what metrics signal its health.

Harness Component Explorer

Click each component to explore implementation detail

Interactive

Tool Registry

L1 — Orchestration

Verification

L2 — Correctness

Context Mgmt

L3 — Memory

Guardrails

L4 — Safety

Observability

L5 — Visibility

State Engine

L6 — Persistence

Tool Registry & Orchestration

The tool registry is the complete manifest of capabilities available to the agent. It determines what the agent can do. Orchestration governs how those capabilities get chained into coherent workflows.

4–5

Tools per bounded task

Retry tiers before escalation

<200ms

Tool dispatch overhead

Required Atomic tool design: each tool does one thing. Composite tools hide failure modes.
Required Typed tool schemas with JSON Schema validation. Reject malformed calls before execution.
Recommended Tool selection rationale logging — record why each tool was called for trace analysis.
Recommended Exponential backoff on transient failures with structured error payloads returned to the agent.
Advanced Dynamic tool loading: inject domain-specific tools only for the task type detected, not all tools for all tasks.
Advanced Tool budgets: cap maximum calls per tool per session to prevent runaway loops.

Verification Loops

Verification that fires during execution — not just at the end. Intermediate checkpoints with structured pass/fail signals are what make self-correction practical. End-of-task verification alone is nearly useless for multi-step work.

±40%

Pass rate improvement, intermediate vs. terminal

Verification tiers (syntax, semantic, integration)

Required Completion criterion check: the agent must verify the stated objective before declaring done.
Required Syntax-level verification (linters, type checkers) as blocking gates — not advisory warnings.
Recommended Semantic verification: test suite execution with structured output fed back as agent context.
Recommended Regression lock: before a task completes, verify all pre-existing tests still pass.
Advanced Differential verification: compare the agent's stated change against actual diff to detect undisclosed modifications.
Advanced Cross-agent verification: route high-stakes outputs to a separate evaluator agent before committing.

Context Management

Context is finite and expensive. A harness engineers context as a first-class resource. The two failure modes are symmetric: too little (agent lacks what it needs) and too much (attention diluted, context anxiety sets in).

~100 lines

AGENTS.md target size

75%

Context fill threshold for compaction

todo.md

Goal re-anchor pattern

Required AGENTS.md / CLAUDE.md: a structured entry-point (~100 lines), not an encyclopedia. Table of contents with deep-link pointers.
Required Proactive compaction: summarize context at 75% fill threshold. Don't wait for anxiety to set in.
Recommended todo.md pattern: the agent re-reads its plan at each checkpoint, preventing goal drift over long tasks.
Recommended Tiered context loading: minimal system context always-on; domain context loaded on demand; session state reconstructed from checkpoints.
Advanced Semantic deduplication: detect and collapse redundant context before injection using embedding similarity.
Advanced Conversational memory distillation: after each session, extract and persist a structured summary for multi-session continuity.

Guardrails & Permissions

The core architectural distinction: telling an agent "follow our coding standards" in a prompt relies on probabilistic compliance. Wiring a linter that blocks the PR when standards are violated enforces deterministic constraints. One is advisory. The other is structural.

2-stage

Classifier design (fast gate + CoT)

<7%

Target human approval rate for non-trivial actions

Required Least-privilege tool access: the harness defines what tools exist, not the agent. Agent cannot self-provision capabilities.
Required Feedforward constraints: rules files, type systems, architectural lint — reducing solution space before generation.
Required Feedback enforcement: post-generation violation detection that triggers structured error recovery loops.
Recommended Two-stage approval gates: fast single-token classifier first; chain-of-thought reasoning only on flagged actions. Avoids approval fatigue.
Recommended Reversibility scoring: rate each planned action on a reversibility scale; require higher approval thresholds for irreversible operations.
Advanced Dynamic permission tiers: escalate permission requirements based on task complexity, session history, and detected confidence signals.

Observability

Full session replay, per-turn token attribution, subagent execution trees, and complete cost tracking. Every tool call, every context load, every verification result, every guardrail trigger — in a structured, queryable trace. Without this, debugging is archaeology.

5 planes

Trace, eval, cost, guardrail, perf

P50/P95

Latency SLOs per tool type

Required Structured trace format: every turn emits a structured log with turn ID, tool calls, token counts, and outcome.
Required Failure attribution: each failure is tagged to a layer (orchestration, verification, context, guardrail) not just flagged as "agent error."
Recommended Session replay: ability to replay any session from trace data, substituting different harness parameters to test fixes.
Recommended Cost attribution per task type: understand token consumption patterns across your task distribution to optimize context loading.
Advanced Automated harness improvement: the observability layer feeds a harness-improvement agent that proposes AGENTS.md additions from failure patterns.
Advanced Cross-session aggregate analysis: identify patterns across sessions (not just within) to catch systemic harness gaps invisible in individual traces.

State Engine & Persistence

Context windows are ephemeral. A state engine persists what matters across turns and sessions — structured checkpoints, decision logs, task state, and environmental snapshots — making multi-session autonomous work tractable.

3 layers

Working / session / long-term memory

checkpoint

State snapshot at phase gates

Required Phase-gate checkpointing: persist structured state at each phase boundary, enabling resume after interruption.
Required Decision log: every significant architectural decision is persisted with rationale, preventing the agent from re-litigating settled choices.
Recommended Three-tier memory: working memory (current turn), session memory (current task), long-term memory (cross-session knowledge).
Recommended Environmental snapshot: capture file system state, dependency versions, and tool configurations at session start for reproducibility.
Advanced Anthropic Managed Agents pattern: virtualize session (durable event log), sandbox (disposable container), and harness loop as separate components with independent lifecycle management.

Architecture

A production harness has six layers.

A production-grade harness is a layered system. Missing any layer doesn't degrade performance gracefully — it creates systematic failure modes that look like model problems but are harness deficits. Click each layer to expand the detail.

Tool Orchestration

The central nervous system

▼

Tool orchestration is what transforms the model from a passive text generator into an autonomous actor capable of multi-step workflows. It governs how the agent accesses environments (secure file systems, shells, internal APIs) and how intelligently it chains these utilities together.

Robust orchestration includes dynamic error handling — the ability to recognize when something went wrong, pivot strategy, and recover without human intervention. This is where most naive agent implementations fail: no error recovery, no retry logic, no graceful degradation. The agent hits a wall and either halts or spirals.

Pattern Error Recovery Hierarchy

Tier 1: retry with same tool and parameters (transient failures). Tier 2: retry with modified parameters or alternative tool. Tier 3: escalate to human with structured error context. Tier 4: checkpoint state and halt cleanly. Agents without this hierarchy either give up immediately or loop forever.

The tool registry defines what the agent can do. Making it narrow and atomic by default — 4–5 tools per bounded task — is standard practice. Scope creep in tool access is as dangerous as scope creep in code. Tool proliferation creates combinatorial planning complexity that degrades agent decision quality.

Verification Loops

Catching failures before they become facts

▼

Verification loops are automated quality-assurance steps that fire during execution, not just at the end. This distinction matters enormously. A final verification that fires after the agent has made 40 interdependent changes gives almost no useful signal and is nearly impossible to recover from. Intermediate verification at each meaningful checkpoint is what makes self-correction practical.

Anthropic's research identified "victory declaration bias" as one of the most consistent failure modes: agents frequently mark tasks complete without verifying the outcome. A verification loop that actually checks the stated completion criterion — and returns a structured error if it fails — closes this loop deterministically.

Implementation Three-Tier Verification

Tier 1 (Syntax): linters, type checkers, formatters — blocking gates, not warnings. Tier 2 (Semantic): test suite execution; structured pass/fail fed back as agent context. Tier 3 (Integration): end-to-end smoke tests verifying the change works in a real environment. Each tier gates progression to the next.

Spotify's Honk system and Apiiro's 2025 analysis both converge on the same conclusion: without active verification loops, AI-generated code volume scales faster than quality can be maintained. The verification layer is what makes volume sustainable.

Context & Memory

Treating context as a curated, perishable resource

▼

Context is finite and expensive. A harness engineers context as a first-class resource, not an afterthought. The two failure modes are symmetric: too little context (the agent lacks what it needs) and too much context (the agent's attention is diluted, or "context anxiety" sets in as the window fills).

Anthropic identified "context anxiety" as a distinct failure mode: as the context window approaches its limit, models begin rushing toward completion, cutting corners, and making lower-quality decisions to avoid running out of space. The harness fix is proactive context compaction and structured checkpointing — not just hoping the model stays rational.

Critical The 75% Rule

Trigger proactive context compaction at 75% fill, not 95%. At 95%, context anxiety is already affecting output quality. Compaction at 75% gives the model enough breathing room to reason clearly about summarization — and produces better summaries as a result.

The canonical pattern is a short AGENTS.md or CLAUDE.md file (~100 lines) as the entry point — not an encyclopedia but a table of contents, with pointers to deeper docs that get pulled on demand. The todo.md pattern has the agent re-read its plan at each step, preventing goal drift across a long task. These aren't prompt tricks. They're harness conventions.

Guardrails & Permissions

Deterministic constraints over probabilistic compliance

▼

The core architectural distinction of harness engineering: telling an agent "follow our coding standards" in a prompt is fundamentally different from wiring a linter that blocks the PR when standards are violated. The first relies on probabilistic compliance. The second enforces deterministic constraints.

Guardrails operate at two levels. Feedforward controls (constraint harnesses) reduce the agent's solution space before generation begins — rules files, type systems, architectural lint configurations, permission boundaries. Feedback controls (enforcement harnesses) detect violations after the fact and trigger structured error recovery loops.

Design Reversibility-Weighted Approval

Every planned action gets a reversibility score (1–5). Score 1 (writing a local file) requires no approval. Score 3 (modifying a shared service config) requires soft confirmation. Score 5 (deleting data, making external API calls with side effects) requires explicit human approval regardless of context. The harness enforces this scoring, not the model's self-assessment.

Anthropic's two-stage classifier design for Claude Code is instructive: a fast single-token gate first, chain-of-thought reasoning only on flagged actions. This avoids "approval fatigue" (users approving 93% of prompts reflexively makes approvals meaningless) while maintaining real oversight for genuinely high-risk actions.

Observability

You cannot improve what you cannot see

▼

Observability in a harness context means full session replay, per-turn token attribution, subagent execution trees, and complete cost tracking. It spans all other layers — every tool call, every context load, every verification result, every guardrail trigger — in a structured, queryable trace.

Without this, debugging is archaeology. Teams know something went wrong; they don't know which layer failed, which tool call was the inflection point, or why the agent took the path it did. With it, failure analysis becomes systematic.

Emerging Standard The Unified Observability Plane

AgentOps, Future AGI, and Anthropic's managed agents all converge on the same architecture: a unified plane that collapses tracing, evals, cost tracking, and guardrail telemetry into a single feedback loop. The goal is to close the distance from "agent failure" to "harness fix" from days to hours to (eventually) minutes.

LangChain's improvement from Top 30 to Top 5 on Terminal Bench 2.0 was driven significantly by using LangSmith tracing at scale to identify failure modes and iteratively optimize the harness — not by intuition, but by data from the observability layer.

State Engine & Persistence

Making multi-session work tractable

▼

Context windows are ephemeral. A state engine persists what matters across turns and sessions — structured checkpoints, decision logs, task state, and environmental snapshots. Without it, every new session starts from scratch, making long-horizon autonomous work impossible.

The three-tier memory model is emerging as the standard: working memory (current turn context), session memory (structured state for the current task), and long-term memory (cross-session knowledge about the environment, conventions, and past decisions).

Pattern Decision Log Anti-Pattern

One of the most underappreciated failure modes: an agent makes an architectural decision in session one, and in session three revisits and reverses it — because the decision wasn't persisted. Decision logs that record not just what was decided but why prevent this regression and eliminate wasted cycles relitigating settled choices.

Anthropic's Managed Agents (April 2026) virtualizes exactly these three components: the session (a durable event log outside the context window), the sandbox (a disposable container where code runs), and the harness loop itself. This pattern is quickly becoming the reference architecture for stateful agent systems.

What breaks agents

Named failure modes. Harness-level fixes.

Anthropic's research and OpenAI's field report both identified systematic failure modes that are inherent to how language models reason — but solvable at the harness level. Understanding them by name is the first step to fixing them architecturally.

FAILURE 01

● Model Bias

Victory Declaration Bias

Agents frequently mark tasks complete without verifying the outcome. Estimated to account for 30–40% of agent failures in production code tasks.

Harness Fix

Mandatory completion criterion check: a verification loop that inspects the actual artifact against the stated objective before the task is marked done. Blocked by CI if verification is skipped.

FAILURE 02

● Context Failure

Context Anxiety

As context fills toward the limit, models rush toward completion, cutting corners, making lower-quality decisions, and hallucinating resolutions.

Harness Fix

Proactive context compaction at the 75% fill threshold with structured checkpoint injection. The harness, not the model, decides when to compact.

FAILURE 03

● Planning Failure

One-Shotting Overreach

Agents try to solve entire problems in a single execution, producing undocumented tangles of interdependent changes that are impossible to review or roll back.

Harness Fix

Phase gates enforced by the harness: design phase → implementation phase → test phase → review phase. Phase transitions require explicit verification outputs.

FAILURE 04

● Loop Failure

Doom Loops

Agents get stuck in repetitive cycles — trying the same broken approach over and over. Token budgets exhaust. No useful work is produced.

Harness Fix

Middleware loop-detection hooks that identify identical or near-identical tool call sequences across turns, inject a circuit breaker, and redirect strategy before budget is exhausted.

FAILURE 05

● Search Failure

Grep-Spree

Without a structured environment map, agents burn massive tokens blindly searching for files or data — often finding what they need on the 15th attempt.

Harness Fix

A well-structured AGENTS.md with explicit directory conventions and a codebase map. Agents can navigate purposefully instead of exploring exhaustively.

FAILURE 06

● Drift Failure

Goal Drift

Over long multi-session tasks, agents lose track of the original objective, gradually drifting toward sub-goals that feel locally important but diverge from the actual target.

Harness Fix

The todo.md pattern: the agent re-reads its stated plan at each checkpoint, re-anchoring to the primary objective before proceeding to the next phase.

FAILURE 07

● Decision Failure

Autonomous Irreversibility

Agents make irreversible architectural decisions without the context to evaluate long-term consequences — decisions that look locally correct but create large downstream technical debt.

Harness Fix

Reversibility scoring applied to every planned action. High-reversibility-impact actions trigger a decision.md checkpoint and human review before execution.

FAILURE 08

● Scope Failure

Helpful Overreach

Agents, in attempting to be maximally helpful, "fix" things beyond the stated scope — introducing unreviewed changes that break other functionality.

Harness Fix

Explicit scope boundaries in AGENTS.md combined with diff-gating: any change outside declared scope files triggers a scope violation warning and human confirmation.

These aren't model quality problems. They're architectural absences. Better prompting addresses none of them. Each has a deterministic harness-level fix that makes the failure mode structurally harder to repeat — not just less likely in any given run.

Control theory for AI

Decision architecture: feedforward and feedback control.

Harness engineering borrows a precise framing from control theory. A well-designed harness uses both feedforward and feedback control — two complementary mechanisms that together produce robust agent behavior.

⬆️

Feedforward Control

Constraint Harnesses — Before generation

Reduce the agent's solution space before any output is generated. The agent cannot produce certain classes of error because the harness prevents those paths from being accessible.

Rules files and architectural lint configurations (hard constraints on patterns the agent may use)
Type systems and schema validation (invalid types become unrepresentable)
Permission boundaries (tools not in the registry cannot be called)
AGENTS.md conventions (directory, naming, and scope rules)
Scope boundary declarations (explicit allowed-file manifests)

🔁

Feedback Control

Enforcement Harnesses — After generation

Detect violations in the agent's output and trigger structured correction loops. The agent can still make mistakes; the harness ensures those mistakes are caught and corrected before they become permanent.

Post-generation linting with structured error payloads fed back to the agent
Test suite execution with failure context as agent input
Scope violation detection with diff analysis
Loop detection middleware that identifies repetition patterns
Victory declaration interceptors that require actual verification

The optimal harness uses both. Feedforward controls are more efficient — they prevent errors entirely. But they can't cover every case. Feedback controls are your safety net — they catch what feedforward missed. Neither alone is sufficient. Together, they create defense in depth against every known agent failure mode.

The OWASP LLM06:2025 "Excessive Agency" risk framing provides the checklist for feedforward control: over-provisioned functions, unnecessary permissions, and missing approval mechanisms are the attack surface. The harness is the defense. Every entry in that checklist maps directly to a harness engineering decision.

Proof in production

Same model. Radically different agent.

The most cited empirical proof of harness engineering's impact is LangChain's Terminal Bench 2.0 results in early 2026. They ran a controlled experiment: improve the agent's performance without changing the underlying model. The result settled the debate.

Terminal Bench 2.0 — LangChain deepagents-cli

Model: gpt-5.2-codex (unchanged across both runs)

Controlled

Before harness work

52.8%

Rank: Outside Top 30

After harness work

66.5%

Rank: Top 5 → +25 positions

Three changes (all harness, zero model)

System prompts emphasizing self-verification loops — agents check their own work before declaring completion (targets Victory Declaration Bias)
Enhanced context injection — structured environment maps so agents understand their working directory without blind searching (targets Grep-Spree)
Middleware loop-detection hooks — circuit breakers that fire when the agent enters a repetitive pattern and redirect strategy (targets Doom Loops)

Notable: running at maximum reasoning budget scored worse (53.9%) due to timeout failures — more model compute does not substitute for harness quality. The harness matters more than the reasoning budget.

The independent validation from Ewan Mak's team reinforces this: working with a financial services client's codebase, the same Claude Sonnet 4.6 model went from a 58% pass rate to 81% — purely through harness changes. No weight updates. No model swap. Two weeks of harness engineering: rewriting the system prompt for monorepo layout, adding subagent delegation, wiring linter output back as middleware observations.

Performance leverage: model swap vs. harness investment

Model swap (same harness)

~15%

Harness optimization only

~70%

Model + harness together

~92%

More reasoning budget (no harness)

~8%

Indicative ranges from 2026 practitioner reports. Exact gains depend on baseline harness quality and task domain.

The AGENTS.md pattern

Every line of this file is a past failure.

Mitchell Hashimoto's Ghostty terminal repository became the canonical reference implementation. Its AGENTS.md file is a living artifact: each rule corresponds to a specific past agent failure that's now prevented. The file grows incrementally. Each addition makes the failure mode structurally harder to repeat.

# AGENTS.md — Ghostty pattern (reconstructed with annotations) # This file is version-controlled. Every rule maps to a specific past failure. # Rule additions go through PR review. Deletions require explicit justification.
## Environment Setup - Build with: `zig build` (not cmake, not make, not any other build tool) # ← Agent used cmake on first unguided run. Now physically impossible to mistake. - Test with: `zig build test` before marking any task complete # ← Agent skipped tests on sessions 3 and 7. Now CI gate — PR cannot merge without test proof. - Never modify files in `vendor/` or `deps/` directories # ← Agent modified vendored dep. Broke reproducibility for the whole team. - If build fails, read the error fully before retrying. Do not retry without understanding. # ← Agent blind-retried 11 times on a config error that the first error message described clearly.
## Code Conventions - All public functions must have doc comments — no exceptions # ← Agent shipped 14 undocumented public APIs in one PR. Team couldn't onboard to them. - `catch unreachable` is only valid in test files. Never in production paths. # ← Agent used catch unreachable in a production error handler. Panic in prod. - Do not use `@import("std").log` in library code. Use the logger abstraction. # ← Added after agent bypassed the structured logging layer. Lost structured fields.
## Scope Control (Failure: One-Shotting Overreach) - Maximum 300 lines changed per PR. Split larger changes. # ← Agent one-shotted a 2,100-line change. Physically impossible to review meaningfully. - Structure every feature as: design PR → implementation PR → test PR # ← Agent mixed design decisions and implementation. Created unmaintainable coupling. - If a change touches more than 5 files, create an architecture note first. # ← Agent made broad structural changes across 23 files without a design doc.
## Verification (Failure: Victory Declaration Bias) - Run `zig build test` AND `zig build lint` before marking any task done. # ← Victory declaration bias. Now a blocking CI requirement, not a request. - If a test fails, diagnose before fixing. Write the diagnosis as a comment. # ← Agent pattern-matched fixes without understanding failures. Masked bugs. - If unsure about a decision, create `docs/decisions/DECISION-YYYYMMDD.md` and stop. # ← Agent made irreversible architectural choices. Decision log now prevents this.
## Context (Failure: Goal Drift) - Re-read `todo.md` at the start of each session and before each phase transition. # ← Agent drifted from primary objective across 3 sessions. todo.md re-anchor prevents this. - If context is > 75% full, summarize state to `checkpoints/` before continuing. # ← Context anxiety caused agent to hallucinate a resolution at 92% context fill.

This isn't a system prompt. It's a harness configuration document. The difference matters: a system prompt is probabilistic guidance. An AGENTS.md wired to CI gates is a deterministic constraint. One is advisory. The other is structural.

The pattern scales. OpenAI's team enforced "taste invariants" — a small set of rules encoding team engineering standards — as hard CI failures, not warnings. The harness grows each time an agent makes a novel mistake. The team gets smarter without the model changing.

An AGENTS.md that hasn't changed in two weeks isn't mature — it's either perfect (unlikely) or the team isn't learning from agent failures (more likely). A healthy AGENTS.md grows with every novel failure. The rate of new rule additions is a leading indicator of how actively the team is practicing harness engineering.

Practical guide

Building a production harness: the implementation sequence.

Harness engineering is not a monolithic project. It's an incremental discipline. The following sequence reflects what teams building production agent systems in 2026 found actually works — ordered by leverage, not complexity.

WEEK 1–2

Foundation: AGENTS.md + Basic Verification — Write the initial AGENTS.md from your first agent session. Every observed failure becomes a rule. Wire basic linting as a blocking CI gate. These two steps alone eliminate 40–50% of recurring failures.

Start here

WEEK 3–4

Control: Guardrails + Permission Tiers — Define the tool registry explicitly. Implement reversibility scoring. Set up the two-stage approval classifier for high-stakes actions. Map your compliance requirements to feedforward constraints.

WEEK 5–8

Memory: Context Management + State Engine — Implement the 75% compaction trigger. Set up todo.md re-anchor pattern. Build phase-gate checkpointing. Add decision log infrastructure. Begin tiered context loading.

L3+L6

WEEK 9–12

Visibility: Observability Plane — Instrument structured traces across all layers. Build session replay capability. Set up cost attribution per task type. Connect the failure-to-harness-improvement feedback loop.

MONTH 4+

Scale: Multi-Agent + Domain Evals — Build orchestrator patterns for parallel work. Develop custom evaluation datasets from your actual task distribution. Begin automated harness improvement from observability signals.

Advanced

The most common mistake: trying to build all six layers simultaneously before any layer works well. A deeply engineered L1–L2 (orchestration + verification) produces more reliable agents than a thin implementation of all six. Depth on the first two layers is the highest-leverage starting point.

A practical harness audit checklist — the questions to ask before declaring a harness production-ready:

✓
Can the agent fail gracefully without human intervention? Is there a retry hierarchy with circuit breakers?
✓
Does every task have an explicit completion criterion that the agent verifies before marking done?
✓
Is there a proactive context compaction trigger that fires before context anxiety sets in?
✓
Are coding standards enforced as deterministic CI gates, not probabilistic prompt guidance?
✓
Does every significant agent action produce a structured trace entry with layer attribution?
!
Is there a loop-detection mechanism that fires before token budgets are exhausted?
!
Does the agent re-anchor to the primary objective at each phase transition?
!
Are irreversible actions gated by reversibility-weighted approval requirements?
✗
Do you have custom evaluation datasets built from your actual task distribution? (Most teams don't yet.)
✗
Is the observability layer feeding an automated harness improvement loop? (Frontier practice — not yet standard.)

Advanced patterns

Multi-agent harnesses: orchestrating agent fleets.

Single-agent harnesses are the foundation. The emerging frontier is harnesses that coordinate multiple specialized agents working in parallel on shared tasks — and the harness complexity grows non-linearly with agent count.

🎯

Orchestrator Pattern

One agent manages many

An orchestrator agent breaks down complex tasks, delegates subtasks to specialized workers, and aggregates results. The harness must manage inter-agent communication, state sharing, and conflict resolution.

Anthropic's three-agent harness study — Planner, Generator, Evaluator — on a 2D retro game engine task produced significantly better results than a solo unconstrained agent on the same task and model. The structured hand-offs and explicit evaluator role removed the victory declaration bias from the generation agent.

Planner receives the goal and produces a structured subtask manifest
Generator executes individual subtasks against the manifest
Evaluator verifies each output before the Planner advances

🔄

Parallel Agent Pattern

Many agents, shared workspace

Multiple agents work simultaneously on different parts of a shared codebase. The harness must enforce workspace isolation, coordinate writes, and prevent conflicts — essentially a distributed system problem applied to AI systems.

Open research problems: orchestrating hundreds of agents in parallel on a shared codebase without write conflicts. Harnesses that dynamically assemble tools and context just-in-time for a given subtask. Agents that analyze their own traces to propose harness-level fixes — closing the improvement loop autonomously.

File-level locking prevents write conflicts
Dependency graph analysis routes agents to non-conflicting work
Merge coordination reviews parallel outputs before integration

Multi-agent trust model

In a multi-agent system, inter-agent communication is an attack surface. Anthropic's guidance: treat messages from sub-agents with the same skepticism as messages from untrusted users. Agent authority is granted by the harness, not by another agent.

This prevents "agent impersonation" — a pattern where a compromised or hallucinating sub-agent claims elevated permissions it doesn't have, and a naive orchestrator grants them.

Managed Agents pattern (April 2026)

Anthropic's Managed Agents virtualizes three components: the session (durable event log outside the context window), the sandbox (disposable container where code runs), and the harness loop itself.

This covers L1, L6, and part of L5. The remaining layers — guardrails tuned to your domain and observability integrated with your systems — remain the custom work every team must do.

Build vs. buy

Prebuilt harnesses vs. custom harnesses.

Most AI coding agents ship with a default harness already built in. Claude Code is the clearest example: file read/write tools, terminal command execution, a multi-step execution loop, permission controls that prompt for human approval before risky actions. That default harness is what makes it an agent rather than a chatbot.

📦

Prebuilt Harness

Claude Code, Cursor, Copilot Workspace

General-purpose execution capability out of the box. Tools, basic orchestration, a default permission model, standard verification. This is the commodity layer.

Production-ready for general tasks immediately
MCP extension points for custom tools
Maintained and updated by the platform
Integrated observability (LangSmith, Anthropic traces)
Battle-tested across millions of agent sessions

No knowledge of your domain constraints or conventions
No enforcement of your compliance requirements
No integration with your internal audit systems
Generic verification — doesn't know your test suite structure

🔧

Custom Harness Layer

Your organizational accountability layer

Domain-specific tools, custom evals, permission models tuned to your risk posture. Built on top of the prebuilt platform, not instead of it. This is the moat.

Proprietary compliance linters enforced as CI gates
Audit logging to internal regulatory systems
Domain-specific tool registry and permission tiers
Custom evals benchmarked on your actual task distribution
AGENTS.md encoding your team's accumulated failure-learning

Requires 4–12 weeks to build a production-grade layer
Ongoing maintenance as the prebuilt platform evolves
Requires dedicated harness engineering investment

Decision	Buy (Prebuilt)	Build (Custom)
Tool orchestration runtime	✓ Buy
Basic permission controls	✓ Buy
Managed agent sandboxing	✓ Buy
General observability telemetry	✓ Buy
Domain-specific tool registry		✓ Build
Compliance-specific guardrails		✓ Build
Internal audit log integration		✓ Build
AGENTS.md (living document)		✓ Build
Custom evaluation datasets		✓ Build

The strategic recommendation from the 2026 landscape: buy the commodity plumbing (managed runtimes, basic telemetry, control planes) and build the proprietary integrations — domain-specific tools, custom evaluation datasets, internal environment maps, compliance-specific guardrails. The prebuilt harness provides general-purpose reliability. The custom layer provides organizational accountability. Both are necessary. Neither replaces the other.

The next frontier

Environment Engineering: inverting the problem.

The strategic horizon emerging in mid-2026 inverts the entire framing. Instead of building smarter harnesses to navigate messy legacy systems, forward-looking organizations are re-architecting their internal APIs, codebases, and databases to be inherently legible to AI agents.

The harness cost reveals the environment cost. If your codebase requires a sophisticated harness to navigate, that's a signal that the codebase itself has complexity debt.

— Emerging principle, AI Platform Engineering teams, 2026

This is Environment Engineering — and it's the next competitive frontier. Clean, well-documented, consistently structured systems need less harness scaffolding to produce reliable agent work. The inverse relationship is reliable: harness complexity is a proxy for environment complexity.

Environment Engineering patterns

Consistent naming conventions enforced by linters so agents can infer structure without searching. Explicit API contracts maintained as machine-readable specs. Modular architectures that create natural subtask boundaries.

Test suites designed for agents — rapid, unambiguous feedback. Fast unit test cycles rather than slow integration tests as the primary verification signal.

CNCF's four pillars

The CNCF's four pillars of platform control — golden paths, guardrails, safety nets, and manual review — are emerging as the design principles that every production harness will implement.

The convergence of CI/CD tooling, agent harnesses, and platform engineering into a unified discipline is the trajectory of the field. Platform engineering becomes AI infrastructure engineering.

The long-term trajectory: organizations that invest in environment engineering today are building a compounding advantage. A cleaner environment requires less harness, which reduces engineering cost, which allows more agent capacity for productive work, which accelerates the flywheel. Harness quality today determines environment quality tomorrow.

Contents

Promptingto HarnessEngineering

88% of enterprise AI agents never make it to production.

Three eras. One trajectory.

Prompt Engineering: Talking to the Model

Context Engineering: Teaching the Model What to Know

Harness Engineering: Designing the Environment

The model is not the variable.

Inside the harness: components, patterns, requirements.

Tool Registry & Orchestration

Verification Loops

Context Management

Guardrails & Permissions

Observability

State Engine & Persistence

A production harness has six layers.

Named failure modes. Harness-level fixes.

Victory Declaration Bias

Context Anxiety

One-Shotting Overreach

Doom Loops

Grep-Spree

Goal Drift

Autonomous Irreversibility

Helpful Overreach

Decision architecture: feedforward and feedback control.

Feedforward Control

Feedback Control

Same model. Radically different agent.

Every line of this file is a past failure.

Building a production harness: the implementation sequence.

Multi-agent harnesses: orchestrating agent fleets.

Orchestrator Pattern

Parallel Agent Pattern

The harness engineering maturity model.

Prebuilt harnesses vs. custom harnesses.

Prebuilt Harness

Custom Harness Layer

Environment Engineering: inverting the problem.

The model is table stakes.The harness is the moat.

Primary sources

Prompting
to Harness
Engineering

The model is table stakes.
The harness is the moat.