Harness Wars — OpenAI vs Anthropic, and What We Need to Prepare For

0. 30-second summary

In March–April 2026, Anthropic and OpenAI raised "harness" as an official topic of discussion about three weeks apart.

Anthropic — 2026.3.24, "Designing harnesses for long-running applications" (Prithvi Rajasekaran)
OpenAI — 2026.4.15, Agents SDK 2.0 — long-horizon harness, sandbox, subagents

MCP, skills, and CLIs are being pushed as standardized and open — but for the harness layer, both vendors are signaling "design it inside our SDK."

At the same time, infra costs for both PCs and local inference are exploding. Even though Google's TurboQuant (2026.3.25) cuts memory by ~6×, the Jevons Paradox means total demand goes up, not down.

We have to pick one of two paths — play on top of a vendor-locked harness, or design a hybrid native harness ourselves.

1. Why is everyone suddenly talking about "the harness"?

A "harness" is originally horse tack — gear that converts a horse's raw power into controllable direction. In AI, the term now refers to the entire outer-layer system that turns the LLM (the horse) into a controllable worker.

"Claude Code serves as the agentic harness around Claude: it provides the tools, context management, and execution environment that turn a language model into a capable coding agent."

— Anthropic official docs (Memorizer: How Claude Code Works — Harness, AI Agent, Agentic Loop)

The six core components are:


flowchart LR
    Model["LLM<br/>(the horse = inference engine)"]
    subgraph Harness["Harness (the tack)"]
        Tools["Tools<br/>Bash / Read / Write"]
        Perm["Permissions &<br/>approval gates"]
        Sandbox["Sandbox"]
        Session["Session &<br/>memory"]
        Context["Context window<br/>management"]
        Ext["Extensions:<br/>MCP / Skills / Hooks"]
    end
    Model --> Harness
    Harness --> Agent["AI Agent<br/>(judges & acts on its own)"]

Looking at the last two years, the standardization wave moved from model → tools → MCP → skills → CLI. Then in spring 2026, both vendors simultaneously tried to grab the reins back on one specific layer: the harness.

2. Anthropic's harness — "Start simple, add complexity only when needed"

Primary source: <https://www.anthropic.com/engineering/harness-design-long-running-apps> (2026.3.24)

Anthropic published its design principles for harnesses targeting long-running apps (jobs that span hours or days).

2.1 Design philosophy

"Find the simplest solution possible, and only increase complexity when needed."

— Prithvi Rajasekaran, Anthropic Labs

"Every component in a harness encodes an assumption about what the model can't do on its own."

In other words, the harness is code that compensates for the model's weaknesses, and as the model gets stronger, parts of the harness become candidates for deletion. Discarding is part of the design.

2.2 Core pattern — Generator / Evaluator / Planner (GAN-inspired)


flowchart LR
    User["User<br/>(one-line prompt)"] --> Planner["Planner<br/>(turns prompt into detailed PRD)"]
    Planner --> Generator["Generator<br/>(actual builder)"]
    Generator -->|"Sprint Contract<br/>(agreed definition of done)"| Evaluator["Evaluator<br/>(QA via Playwright MCP)"]
    Evaluator -->|"FAIL"| Generator
    Evaluator -->|"PASS"| Done["Done"]

"The generator and evaluator negotiated a sprint contract: agreeing on what 'done' looked like for that chunk of work before any code was written."

2.3 Two findings Anthropic emphasized

Finding	Implication
Models tend to evaluate their own work with "confident praise"	Self-evaluation is unreliable → split out a separate Evaluator agent
Context "full reset (new session)" beats "compression"	Context-window compression tricks have hit a wall; session splitting is more stable

2.4 Cost data (real numbers)

Prompt	Single agent	Full harness
"Create a 2D retro game maker"	20 min / $9	6 hours / $200 (20×, with significantly higher quality)
Building a DAW (Opus 4.6, improved harness)	—	3 h 50 min / $124.70

3. OpenAI's harness — "long-horizon harness × sandbox × 100+ LLMs"

Primary sources:

<https://openai.com/index/the-next-evolution-of-the-agents-sdk/> (2026.4.15)

TechCrunch, "OpenAI updates its Agents SDK..." (2026.4.15)

In Agents SDK 2.0, OpenAI shipped:

3.1 Key concepts

Element	Description
long-horizon harness	Orchestration layer for hour-to-day multi-step jobs. You bring your own compute and storage; the harness provides the coordination layer.
in-distribution harness	A harness "shaped the way the frontier model was trained" — tools, approvals, and tracing that match the model's training distribution to guarantee performance.
sandboxing	Run agents inside controlled compute environments. Contains the risk of unsupervised execution.
subagents / code mode	Codex-style subagents plus a code execution mode.
100+ LLM support	Non-OpenAI models also run on the same harness.
Python first	Python ships first; TypeScript comes later.

"The harness refers to the other components of an agent besides the model it's running on, and an in-distribution harness allows companies to both deploy and test agents running on frontier models."

— TechCrunch (2026.4.15)

3.2 OpenAI's message

OpenAI is bundling AgentKit + Agent Builder + Apps SDK + Agents SDK 2.0 to run the full design → build → deploy → monetize stack inside their own platform (Memorizer: OpenAI DevDay 2025 — ChatGPT's Super-App Strategy).

The pitch:

MCP is "open" — but the harness is meant to be assembled visually inside OpenAI's workbench.

In return you get ChatKit / Widget Builder / Connector Registry as one bundle.

4. OpenAI vs Anthropic — harness comparison

Axis	Anthropic Harness	OpenAI Harness (Agents SDK 2.0)
Announced	2026.3.24	2026.4.15
Core concept	"Start simple → add complexity only as needed"	"long-horizon × in-distribution × sandbox"
Signature pattern	Planner / Generator / Evaluator (3 agents) + Sprint Contract	Sandbox + Subagents + Code Mode + Tracing/Approvals
Model lock-in	Effectively assumes Claude Opus / Sonnet	"100+ LLM compatible" — but in-distribution guarantee only on OpenAI models
Core SDK	Claude Agent SDK + Claude Code	Agents SDK + AgentKit + ChatKit
Visual tooling	None (code-first, Skills/Hooks)	Agent Builder (drag-and-drop canvas)
Openness	"Open-source the harness inspection" — you can look inside	Some open-source + visual builder is cloud-only
Target workload	Full-stack coding, experimental long-running apps	Enterprise workflows, multi-model mixes
Pricing signal	Comfortably bills $124–$200 per task	BYOC (Bring Your Own Compute) — customer pays for compute


flowchart TB
    subgraph Anthropic["Anthropic Harness — code-first minimalism"]
        A1["Claude Agent SDK"]
        A2["Skills + Hooks"]
        A3["Sub-agent pattern<br/>Planner → Generator → Evaluator"]
        A1 --> A2 --> A3
    end
    subgraph OpenAI["OpenAI Harness — full-stack workbench"]
        O1["Agents SDK 2.0<br/>(long-horizon harness)"]
        O2["AgentKit / Agent Builder<br/>(visual canvas)"]
        O3["Sandbox + Subagents<br/>+ Code Mode"]
        O4["ChatKit / Widget Builder<br/>(distribution channels)"]
        O1 --> O2 --> O3 --> O4
    end

5. Why the harness layer specifically is becoming vendor-locked

Skills, MCP, and CLIs have all been standardized or opened up to some degree. The harness is different. There are two structural reasons.

5.1 The "enterprises will pay for this eventually" learned behavior


flowchart LR
    GH["GitHub<br/>(Enterprise)"] -->|"ACL/SAML/Audit"| ENT1["Enterprise ISMS requirement"]
    Slack["Slack<br/>(Enterprise Grid)"] -->|"DLP/eDiscovery"| ENT1
    Notion["Notion<br/>(Enterprise)"] -->|"SCIM/audit logs"| ENT1
    Claude["Claude Team/Enterprise"] -->|"RBAC/SSO"| ENT1
    OpenAI["ChatGPT Business/Enterprise"] -->|"audit logs/usage analytics"| ENT1

Pattern repeated across SaaS history — basic ACL, audit, and SSO always end up gated behind the most expensive plan. The harness is going down the same road:

Free / MAX = personal coding assistant

Team / Enterprise = harness governance (audit, tracing, multi-seat, policy gates)

The 90% ChatGPT Pro discount in Korea (Memorizer: ChatGPT Pro 90% Discount Promotion in South Korea (대한민국)) shows this clearly. Even though Sam Altman openly admitted that the $200/mo Pro plan runs at a loss, OpenAI subsidized it 90% in Korea solely to buy a single metric — conversion rate — ahead of an IPO. Call it a prelude to the harness license war.

5.2 OpenAI's "harness = part of the model's training distribution" claim

OpenAI used the phrase in-distribution harness. The implication: the model's training distribution (behavior cloning, RLHF, RLAIF) already encodes the shape of their harness, and you can't extract 100% of the model's performance without it.

If true, this means no matter how standardized MCP becomes, the harness is the layer the model provider can build most efficiently — i.e., the strongest lock-in point.

6. Infra costs are exploding — both cloud and local get more expensive

6.1 LLMs are getting cheaper like air, but "everything around them" is getting more expensive

"AI's enemy is time. Today's Pro-tier inference becomes free-tier in 6 months."

— Memorizer: ChatGPT Pro 90% Discount Promotion in South Korea (대한민국)

Token prices are dropping. But with the rise of agents, the surrounding infra is getting expensive:

GPU: for LLM inference

Memory / SSD / NAND: local PC prices spiking (as users have noted)

Compute (BYOC): OpenAI explicitly says bring your own compute for the long-horizon harness

6.2 Google's counter — TurboQuant + the Jevons Paradox

Primary source: <https://v.daum.net/v/20260329050205578> (2026.3.29)

On 2026.3.25, Google announced TurboQuant (KV-cache compression + quantized inference).

Metric	Change
Memory usage	as low as 1/6 of baseline
Throughput	up to 8×
Reference: 70B-param LLM serving 512 concurrent users	512 GB → under 100 GB with TurboQuant

Naively this looks bad for memory makers — demand should fall. Right after the announcement, foreign net selling of Samsung Electronics hit ₩2.94T and SK Hynix took a market shock too.

But KB Securities analyst Kim Dong-won and Sungkyunkwan University professor Kwon Seok-jun cited the Jevons Paradox for a different read:

"If inference truly gets cheaper, the freed memory gets used for applications that were previously priced out — total memory demand could explode rather than shrink."


flowchart LR
    A["Efficiency 6×↑<br/>(TurboQuant)"] --> B["Unit cost ↓"]
    B --> C["Agent workloads<br/>explode"]
    C --> D["Total memory · GPU · power<br/>demand ↑↑"]
    D --> E["Local PC parts<br/>+ cloud bills<br/>both rise"]

In short: as the harness era kicks in, LLMs get cheaper, but the agent full-stack bill gets bigger.

6.3 Cloud vs local — both get more expensive at the same time

Category	2024	2026 (now)
Production workloads	Mostly backend servers	• agent compute (BYOC)
Dev machine	Code editor + browser	• local agent + Pencil/Playwright + context cache → 64 GB RAM no longer enough; 96–128 GB becoming standard
SSD / NAND	1 TB sufficed	2 TB+ NVMe (vector DB, model weights, session logs)
GPU	Optional	24 GB+ VRAM is effectively the entry bar (for local sLM / on-device inference)

7. So what should we prepare?

7.1 Three scenarios


flowchart TD
    Choice["Our choice"]
    Choice --> S1["Scenario A<br/>Play on top of vendor-locked harness"]
    Choice --> S2["Scenario B<br/>Design a native harness ourselves"]
    Choice --> S3["Scenario C<br/>Hybrid — partly on-device, partly frontier model"]

    S1 -->|"Pros"| S1P["Fast start, full-stack tooling provided"]
    S1 -->|"Cons"| S1C["Hostage to license & pricing tier, model lock-in"]

    S2 -->|"Pros"| S2P["Full control, multi-vendor possible"]
    S2 -->|"Cons"| S2C["Engineering cost, learning curve"]

    S3 -->|"Pros"| S3P["Cost-optimal, risk diversified"]
    S3 -->|"Cons"| S3C["Vendor harness may not allow this kind of bridging"]

7.2 The trap — "forced upgrade to Team/Enterprise"

Claude's MAX plan is for personal use. Blocking auth above N devices is the exact same playbook Netflix used to kill household sharing — tolerated at first, blocked once IPO and revenue pressure mounts. If you start using it seriously inside a company, you'll likely be forced to migrate to Claude Team / OpenAI Business.

7.3 Recommended internal actions

Action	Why
1. Build our own harness notation standard	Translate Anthropic's Planner-Generator-Evaluator into our domain vocabulary as an internal harness guide (one page). No matter how vendor SDKs change, the layer naming stays under our control.
2. Self-host the MCPs we depend on	Self-host memorizer / Atlassian / Notion MCPs — even if a vendor pushes lock-in, the tool layer survives.
3. Define a BYOC infra catalog	The OpenAI BYOC model may become standard — pre-define internal GPU / VRAM / NVMe baseline specs.
4. Always separate the Evaluator	Anthropic's key finding — never trust model self-eval. Bake an Evaluator step into every writing / code / planning workflow.
5. Treat harness tracing as a contract	What does the vendor send back from our data, and what do they train on? Tracing standards will be the next ISMS clause.

8. Closing — Which side are we betting on?

"Every component in a harness encodes an assumption about what the model can't do on its own."

That one sentence is the whole point. The harness doesn't go away. As the model gets stronger, only its shape changes. Yesterday's required guardrail disappears tomorrow; yesterday's missing governance becomes core tomorrow.

Two questions remain.

"Should we become native harness engineers?" — Yes. But not at every layer — only at the domain-specific layer (e.g., our domain's Evaluator, our 5-axis evaluation, our governance policies).

"Is it more rational to just play on top of the vendor harness?" — Partly yes. But keep three escape hatches reserved in advance: MCP, harness tracing, and data governance.

Vendors will eventually nudge us toward more expensive tiers. To have leverage when that day comes, we need to keep one option always alive: "we can throw the harness away."