The model is commodity. The harness is moat. 88% of enterprise AI agents fail in production — not because the model is wrong, but because the environment around it is broken.
That number, cited across multiple 2026 AI engineering reports, is jarring. Not 20%. Not 40%. Nearly nine out of ten. And the failure isn't what you'd expect. It isn't the model. The model can write code, plan tasks, use tools. The model isn't the problem.
The problem is everything else. Context rot. Absent guardrails. No state persistence. Feedback loops that never close. Agents that declare victory without checking their work. Systems that run beautifully in a demo and catastrophically in production.
The dominant assumption from 2023 through most of 2025 was that the model itself was the determining variable. Teams optimized around the next model release. By late 2025, that assumption had quietly collapsed. The performance gap between frontier models from different labs had narrowed substantially. The gap between the same model with and without a well-designed harness had widened dramatically. The harness became the actual variable.
In February 2026, three converging publications crystallized what practitioners had been building without a shared vocabulary: Mitchell Hashimoto's personal blog post, Ryan Lopopolo's field report from OpenAI, and a cascade from LangChain, Anthropic, and Thoughtworks. A discipline was named: Harness Engineering.
Frontier model capability gaps narrowed. GPT-4-class intelligence became commoditized. The marginal return on a model upgrade dropped; the marginal return on harness investment rose sharply.
Result: engineering effort shifted from prompt crafting to environment design — and teams that made that shift earlier outperformed those still optimizing prompts.
Everything in an AI agent except the model itself. The tools it can use, the permissions it's granted, the state it persists, the tests that verify its work, the logs that make it observable, the guardrails that keep it safe, and the recovery mechanisms that handle failures.
Source: Birgitta Böckeler, Thoughtworks / Martin Fowler, April 2026.
AI engineering maturity has moved through three distinct phases, each one revealing a new bottleneck that the previous phase couldn't solve. Click each era to explore the shift.
The early discovery: phrasing matters enormously. "Write a function that..." produces different output than "You are a senior engineer. Implement a function that...". The whole discipline focused on language as the lever.
Single-turn optimization. One question, one answer. Role-playing, few-shot examples, chain-of-thought cues. Engineers competed to find the magic phrase.
Teams maintained "golden prompts." Performance was fragile — a model update could break a carefully tuned prompt. The knowledge was linguistic, not architectural.
Drafting, summarizing, simple Q&A. Tasks that fit in one exchange with stable, small context worked well. Multi-step, state-dependent, and domain-specific tasks broke it.
Once models got more capable, the bottleneck shifted from how you ask to what information the model has. Better prompts on the wrong context = better-worded wrong answers.
Coined by Andrej Karpathy in December 2025, though practised much earlier. The focus shifted from wording to curation: what files, rules, schemas, and constraints go into the model's context window?
RAG pipelines, MCP servers, structured documentation, codebase maps. Engineers stopped asking "how do I say this?" and started asking "what does the model need to know?"
Semantic retrieval, structured system prompts, CLAUDE.md / AGENTS.md files, per-tool schema injection. Context became a curated artifact, not an afterthought.
Coding assistants with codebase context, customer service bots with product knowledge, research agents with document retrieval. Quality improved dramatically with the right context.
Once models had the right information, the challenge became autonomy and reliability. Agents knowing the right thing didn't mean they did the right thing — or stopped when they should.
Formalized by Mitchell Hashimoto in February 2026. The model is taken as given. The question is: what is the full system of constraints, feedback loops, tools, and verification that turns raw capability into reliable, autonomous production work?
Every agent failure becomes a permanent engineering fix. AGENTS.md grows. CI gates harden. Linters enforce. Verification loops close. The harness gets smarter each time an agent breaks.
Coined by LangChain after OpenAI's field report: the model supplies intelligence, the harness makes that intelligence useful and reliable. Swap the harness; transform the agent.
Full PRs, multi-session refactors, cross-system migrations. Work that previously required sustained human attention, delegated to agents with structured accountability.
When an agent fails, the question is never "which prompt fixes this?" but "which layer is absent? Which guardrail is missing? Which loop doesn't close?" This is the reasoning mode of harness engineering.
| Phase | Period | Core Question | Bottleneck Solved | Primary Output |
|---|---|---|---|---|
| Prompt Engineering | 2022 – 2023 | How do I phrase this? | Language | Code snippets, boilerplate |
| Context Engineering | 2024 – 2025 | What should the model know? | Information | Context-aware feature code |
| Harness Engineering | 2026 → | What can the model do — and not do? | Control | Autonomous end-to-end tasks |
This is the most important reframe in AI engineering in 2026. Swap the model for a competitor and output quality shifts perhaps 10–15%. Change the harness and you change whether the system works at all.
Anytime you find an agent makes a mistake, you take the time to engineer a solution so that the agent never makes that mistake again.
— Mitchell Hashimoto, February 2026Days later, Ryan Lopopolo from OpenAI's technical staff published the most cited field report in AI engineering that year. His team had spent five months building a production product with a single hard constraint: zero manually-written lines of code. Every function, every test, every CI script, every piece of documentation — all generated by Codex agents.
The tagline: "Humans steer. Agents execute." Engineers stopped writing code. They started designing the environment that made reliable code generation possible. When something failed, the fix was almost never "write a better prompt." It was: "What capability, context, or structure is missing from the harness?"
The harness engineering definition from Martin Fowler's site (Birgitta Böckeler, April 2026) captures it precisely: "Everything in an AI agent except the model itself. The tools it can use, the permissions it's granted, the state it persists, the tests that verify its work, the logs that make it observable, the guardrails that keep it safe, and the recovery mechanisms that handle failures."
A harness is not a monolithic object. It is a layered system of components, each with a distinct responsibility. The following explorer details what each component does, what patterns it uses, and what metrics signal its health.
The tool registry is the complete manifest of capabilities available to the agent. It determines what the agent can do. Orchestration governs how those capabilities get chained into coherent workflows.
Verification that fires during execution — not just at the end. Intermediate checkpoints with structured pass/fail signals are what make self-correction practical. End-of-task verification alone is nearly useless for multi-step work.
Context is finite and expensive. A harness engineers context as a first-class resource. The two failure modes are symmetric: too little (agent lacks what it needs) and too much (attention diluted, context anxiety sets in).
todo.md pattern: the agent re-reads its plan at each checkpoint, preventing goal drift over long tasks.The core architectural distinction: telling an agent "follow our coding standards" in a prompt relies on probabilistic compliance. Wiring a linter that blocks the PR when standards are violated enforces deterministic constraints. One is advisory. The other is structural.
Full session replay, per-turn token attribution, subagent execution trees, and complete cost tracking. Every tool call, every context load, every verification result, every guardrail trigger — in a structured, queryable trace. Without this, debugging is archaeology.
Context windows are ephemeral. A state engine persists what matters across turns and sessions — structured checkpoints, decision logs, task state, and environmental snapshots — making multi-session autonomous work tractable.
A production-grade harness is a layered system. Missing any layer doesn't degrade performance gracefully — it creates systematic failure modes that look like model problems but are harness deficits. Click each layer to expand the detail.
Anthropic's research and OpenAI's field report both identified systematic failure modes that are inherent to how language models reason — but solvable at the harness level. Understanding them by name is the first step to fixing them architecturally.
Agents frequently mark tasks complete without verifying the outcome. Estimated to account for 30–40% of agent failures in production code tasks.
Mandatory completion criterion check: a verification loop that inspects the actual artifact against the stated objective before the task is marked done. Blocked by CI if verification is skipped.
As context fills toward the limit, models rush toward completion, cutting corners, making lower-quality decisions, and hallucinating resolutions.
Proactive context compaction at the 75% fill threshold with structured checkpoint injection. The harness, not the model, decides when to compact.
Agents try to solve entire problems in a single execution, producing undocumented tangles of interdependent changes that are impossible to review or roll back.
Phase gates enforced by the harness: design phase → implementation phase → test phase → review phase. Phase transitions require explicit verification outputs.
Agents get stuck in repetitive cycles — trying the same broken approach over and over. Token budgets exhaust. No useful work is produced.
Middleware loop-detection hooks that identify identical or near-identical tool call sequences across turns, inject a circuit breaker, and redirect strategy before budget is exhausted.
Without a structured environment map, agents burn massive tokens blindly searching for files or data — often finding what they need on the 15th attempt.
A well-structured AGENTS.md with explicit directory conventions and a codebase map. Agents can navigate purposefully instead of exploring exhaustively.
Over long multi-session tasks, agents lose track of the original objective, gradually drifting toward sub-goals that feel locally important but diverge from the actual target.
The todo.md pattern: the agent re-reads its stated plan at each checkpoint, re-anchoring to the primary objective before proceeding to the next phase.
Agents make irreversible architectural decisions without the context to evaluate long-term consequences — decisions that look locally correct but create large downstream technical debt.
Reversibility scoring applied to every planned action. High-reversibility-impact actions trigger a decision.md checkpoint and human review before execution.
Agents, in attempting to be maximally helpful, "fix" things beyond the stated scope — introducing unreviewed changes that break other functionality.
Explicit scope boundaries in AGENTS.md combined with diff-gating: any change outside declared scope files triggers a scope violation warning and human confirmation.
These aren't model quality problems. They're architectural absences. Better prompting addresses none of them. Each has a deterministic harness-level fix that makes the failure mode structurally harder to repeat — not just less likely in any given run.
Harness engineering borrows a precise framing from control theory. A well-designed harness uses both feedforward and feedback control — two complementary mechanisms that together produce robust agent behavior.
Reduce the agent's solution space before any output is generated. The agent cannot produce certain classes of error because the harness prevents those paths from being accessible.
Detect violations in the agent's output and trigger structured correction loops. The agent can still make mistakes; the harness ensures those mistakes are caught and corrected before they become permanent.
The optimal harness uses both. Feedforward controls are more efficient — they prevent errors entirely. But they can't cover every case. Feedback controls are your safety net — they catch what feedforward missed. Neither alone is sufficient. Together, they create defense in depth against every known agent failure mode.
The OWASP LLM06:2025 "Excessive Agency" risk framing provides the checklist for feedforward control: over-provisioned functions, unnecessary permissions, and missing approval mechanisms are the attack surface. The harness is the defense. Every entry in that checklist maps directly to a harness engineering decision.
The most cited empirical proof of harness engineering's impact is LangChain's Terminal Bench 2.0 results in early 2026. They ran a controlled experiment: improve the agent's performance without changing the underlying model. The result settled the debate.
The independent validation from Ewan Mak's team reinforces this: working with a financial services client's codebase, the same Claude Sonnet 4.6 model went from a 58% pass rate to 81% — purely through harness changes. No weight updates. No model swap. Two weeks of harness engineering: rewriting the system prompt for monorepo layout, adding subagent delegation, wiring linter output back as middleware observations.
Indicative ranges from 2026 practitioner reports. Exact gains depend on baseline harness quality and task domain.
Mitchell Hashimoto's Ghostty terminal repository became the canonical reference implementation. Its AGENTS.md file is a living artifact: each rule corresponds to a specific past agent failure that's now prevented. The file grows incrementally. Each addition makes the failure mode structurally harder to repeat.
This isn't a system prompt. It's a harness configuration document. The difference matters: a system prompt is probabilistic guidance. An AGENTS.md wired to CI gates is a deterministic constraint. One is advisory. The other is structural.
The pattern scales. OpenAI's team enforced "taste invariants" — a small set of rules encoding team engineering standards — as hard CI failures, not warnings. The harness grows each time an agent makes a novel mistake. The team gets smarter without the model changing.
An AGENTS.md that hasn't changed in two weeks isn't mature — it's either perfect (unlikely) or the team isn't learning from agent failures (more likely). A healthy AGENTS.md grows with every novel failure. The rate of new rule additions is a leading indicator of how actively the team is practicing harness engineering.
Harness engineering is not a monolithic project. It's an incremental discipline. The following sequence reflects what teams building production agent systems in 2026 found actually works — ordered by leverage, not complexity.
todo.md re-anchor pattern. Build phase-gate checkpointing. Add decision log infrastructure. Begin tiered context loading.The most common mistake: trying to build all six layers simultaneously before any layer works well. A deeply engineered L1–L2 (orchestration + verification) produces more reliable agents than a thin implementation of all six. Depth on the first two layers is the highest-leverage starting point.
A practical harness audit checklist — the questions to ask before declaring a harness production-ready:
Single-agent harnesses are the foundation. The emerging frontier is harnesses that coordinate multiple specialized agents working in parallel on shared tasks — and the harness complexity grows non-linearly with agent count.
An orchestrator agent breaks down complex tasks, delegates subtasks to specialized workers, and aggregates results. The harness must manage inter-agent communication, state sharing, and conflict resolution.
Anthropic's three-agent harness study — Planner, Generator, Evaluator — on a 2D retro game engine task produced significantly better results than a solo unconstrained agent on the same task and model. The structured hand-offs and explicit evaluator role removed the victory declaration bias from the generation agent.
Multiple agents work simultaneously on different parts of a shared codebase. The harness must enforce workspace isolation, coordinate writes, and prevent conflicts — essentially a distributed system problem applied to AI systems.
Open research problems: orchestrating hundreds of agents in parallel on a shared codebase without write conflicts. Harnesses that dynamically assemble tools and context just-in-time for a given subtask. Agents that analyze their own traces to propose harness-level fixes — closing the improvement loop autonomously.
In a multi-agent system, inter-agent communication is an attack surface. Anthropic's guidance: treat messages from sub-agents with the same skepticism as messages from untrusted users. Agent authority is granted by the harness, not by another agent.
This prevents "agent impersonation" — a pattern where a compromised or hallucinating sub-agent claims elevated permissions it doesn't have, and a naive orchestrator grants them.
Anthropic's Managed Agents virtualizes three components: the session (durable event log outside the context window), the sandbox (disposable container where code runs), and the harness loop itself.
This covers L1, L6, and part of L5. The remaining layers — guardrails tuned to your domain and observability integrated with your systems — remain the custom work every team must do.
Teams building AI agents in 2026 tend to cluster into five maturity levels. Each level is defined by what the team treats as the primary variable for improvement. Most teams discover they're at Level 1 or 2 — and mistakenly believe they're at Level 3.
The key diagnostic question for any team: "When an agent fails, what's the first thing you change?" If the answer is "the prompt," you're at Level 1. If the answer is "the context," Level 2. If the answer is "which layer of the harness is missing or broken," you're practicing harness engineering. The question itself is the signal.
Most AI coding agents ship with a default harness already built in. Claude Code is the clearest example: file read/write tools, terminal command execution, a multi-step execution loop, permission controls that prompt for human approval before risky actions. That default harness is what makes it an agent rather than a chatbot.
General-purpose execution capability out of the box. Tools, basic orchestration, a default permission model, standard verification. This is the commodity layer.
Domain-specific tools, custom evals, permission models tuned to your risk posture. Built on top of the prebuilt platform, not instead of it. This is the moat.
| Decision | Buy (Prebuilt) | Build (Custom) |
|---|---|---|
| Tool orchestration runtime | ✓ Buy | |
| Basic permission controls | ✓ Buy | |
| Managed agent sandboxing | ✓ Buy | |
| General observability telemetry | ✓ Buy | |
| Domain-specific tool registry | ✓ Build | |
| Compliance-specific guardrails | ✓ Build | |
| Internal audit log integration | ✓ Build | |
| AGENTS.md (living document) | ✓ Build | |
| Custom evaluation datasets | ✓ Build |
The strategic recommendation from the 2026 landscape: buy the commodity plumbing (managed runtimes, basic telemetry, control planes) and build the proprietary integrations — domain-specific tools, custom evaluation datasets, internal environment maps, compliance-specific guardrails. The prebuilt harness provides general-purpose reliability. The custom layer provides organizational accountability. Both are necessary. Neither replaces the other.
The strategic horizon emerging in mid-2026 inverts the entire framing. Instead of building smarter harnesses to navigate messy legacy systems, forward-looking organizations are re-architecting their internal APIs, codebases, and databases to be inherently legible to AI agents.
The harness cost reveals the environment cost. If your codebase requires a sophisticated harness to navigate, that's a signal that the codebase itself has complexity debt.
— Emerging principle, AI Platform Engineering teams, 2026This is Environment Engineering — and it's the next competitive frontier. Clean, well-documented, consistently structured systems need less harness scaffolding to produce reliable agent work. The inverse relationship is reliable: harness complexity is a proxy for environment complexity.
Consistent naming conventions enforced by linters so agents can infer structure without searching. Explicit API contracts maintained as machine-readable specs. Modular architectures that create natural subtask boundaries.
Test suites designed for agents — rapid, unambiguous feedback. Fast unit test cycles rather than slow integration tests as the primary verification signal.
The CNCF's four pillars of platform control — golden paths, guardrails, safety nets, and manual review — are emerging as the design principles that every production harness will implement.
The convergence of CI/CD tooling, agent harnesses, and platform engineering into a unified discipline is the trajectory of the field. Platform engineering becomes AI infrastructure engineering.
The long-term trajectory: organizations that invest in environment engineering today are building a compounding advantage. A cleaner environment requires less harness, which reduces engineering cost, which allows more agent capacity for productive work, which accelerates the flywheel. Harness quality today determines environment quality tomorrow.
Every team building AI agents in 2026 is making a strategic bet. Most are optimizing for the model. The teams that will win are optimizing for the harness.
Humans steer. Agents execute. But the harness is what decides whether that steering produces reliable results — or just faster failures at scale.
— Haroon K M, May 2026