TECH

Context and harness engineering as the new foundation for reliable AI agents

50+

Signals

Strategic Overview

01.
Anthropic published an engineering reference describing an effective harness for long-running coding agents built from an initializer agent and a coding agent that communicate via a claude-progress.txt artifact and git history.
02.
Tencent Cloud open-sourced TencentDB Agent Memory under MIT in May 2026 as a fully local four-tier (L0-L3) memory pipeline for AI agents with zero external API dependencies.
03.
A new arXiv paper, Natural-Language Agent Harnesses (Pan et al., March 2026), proposes representing agent harnesses as editable natural-language documents executed by an Intelligent Harness Runtime so policy is inspectable rather than buried in code.
04.
LangChain codified context engineering for agents as four strategies (write, select, compress, isolate), and Cognition AI called context engineering the number-one job for engineers building AI agents.

Under the Hood: The Four Moves That Replaced Prompt Engineering

The cleanest way to read this whole moment is through LangChain's four-verb framework. Context, they argue, can only be written (saved outside the window), selected (pulled back in when relevant), compressed (kept only as the tokens you actually need), or isolated (split across sub-agents) ^[1]. That tiny taxonomy is doing a lot of work, because every harness pattern shipping in 2025 and 2026 is some specific combination of those four moves.

TencentDB Agent Memory is the most literal implementation. Its four tiers (L0 conversation, L1 atom, L2 scenario, L3 persona) correspond to raw dialogue, atomic facts, scene blocks, and a user profile ^[2]. Full tool logs are written out to refs/*.md files, and live task state is encoded in Mermaid task canvases the agent re-reads instead of carrying a giant transcript. That is write plus compress, made operational. Anthropic's long-running coding harness does the same trick with cruder tools: an initializer agent that sets up the environment on the first run, and a coding agent that makes incremental progress, communicating through a claude-progress.txt artifact and git history because, in their words, context windows are limited and agents need a way to bridge the gap between coding sessions ^[3].

The Natural-Language Agent Harnesses paper from Pan et al. pushes the same logic up a level. Instead of letting these write/select/compress/isolate choices live inside opaque code, NLAHs are editable documents that describe run-level harness policy, and an Intelligent Harness Runtime interprets those documents into agent calls, handoffs, state updates, validation gates, and artifact contracts ^[4]. The harness itself becomes a piece of context, written in language, that any reviewer can read.

The Benchmark That Breaks the Cost-Quality Tradeoff

The hardest claim to dismiss in this whole stack comes from Tencent's benchmark table. Adding TencentDB Agent Memory to the OpenClaw agent cuts token usage on WideSearch by up to 61.38 percent, and at the same time lifts WideSearch task pass rate from 33 percent to 50 percent, a 51.52 percent relative improvement ^[5]. PersonaMem accuracy jumps from 48 percent to 76 percent, twenty-eight points. SWE-bench success climbs from 58.4 percent to 64.2 percent while consuming 33.09 percent fewer tokens, moving from 3,474.1 million to 2,375.4 million ^[5]. The usual cost-quality curve has the two trading off; this is the agent paying less and scoring more on the same model.

The explanation, once you read the architecture, is almost boring: the agent stops dragging an enormous transcript through every step and instead retrieves a compressed, structured slice (atoms, scenes, persona) that actually fits the task. Less junk in the window means both fewer prompt tokens and better attention to what matters. The implication for buyers is sharper than the numbers: a context-and-memory layer is now a cost-reduction project that also happens to raise the ceiling, which is a very different procurement conversation than swapping in a bigger model.

The broader market is pricing this in. The 2026 State of Context Management Report reads that 82 percent of IT and data leaders say prompt engineering alone is no longer sufficient and 95 percent of data teams plan to invest in context engineering training in 2026 ^[6]. When that many buyers are budgeting for the discipline, the open-source memory engines and harness templates land into prepared soil.

Why the AI-Engineering Job Just Split in Two

Addy Osmani's essay on agent harness engineering crystallizes a shift the practitioner community has been circling for a year: a decent model with a great harness beats a great model with a bad harness ^[7]. Inside that line is a labor-market thesis. Model work (pretraining, RLHF, distillation) stays inside a small number of frontier labs. Harness work (context routing, memory, tool definitions, evals, sub-agent orchestration) is where everyone else now has leverage. The slogan circulating with the essay, Agent = Model + Harness, and if you are not the model, you are the harness, names the new division of labor for AI engineers ^[7].

The pattern shows up in the structure of the systems themselves. Anthropic's three-agent harness for long-running application development splits planning, generation, and evaluation across distinct agents to maintain coherence and improve output quality over multi-hour AI sessions ^[8]. Cognition AI is even blunter, telling builders that context engineering is effectively the number-one job of engineers building AI agents ^[1]. Andrej Karpathy framed the same point from the modeling side when he argued for context engineering over prompt engineering, calling it the delicate art and science of filling the context window with just the right information for the next step ^[9].

The community reaction tracks the split very visibly. Spec-first development, intentional compaction triggered by a fixed token threshold, and sub-agents-exposed-as-tools dominate the most-watched practitioner talks on developer YouTube, while the loudest takes on X read more like job descriptions than research commentary, coining harness engineering as an emerging discipline that applies context-engineering principles to how you operate an existing agent. The center of gravity, in other words, has moved out of the prompt and into the surrounding system.

The Harness Half-Life: Why Every Workaround Has a Deprecation Date

There is a contrarian read buried in Anthropic's own post that almost nobody quotes: long-running coding agents need scaffolding because, as Justin Young puts it, getting agents to make consistent progress across multiple context windows remains an open problem ^[3]. The uncomfortable corollary is that every piece of harness machinery (every artifact file, every compaction policy, every sub-agent tool) is a workaround for a current model limitation. When the limitation moves, the workaround becomes dead weight.

This is the load-bearing-for-nothing risk. A planner/generator/evaluator split makes a lot of sense when a single agent loses the plot over a four-hour run; it makes less sense when the next model holds plan coherence natively. A four-tier memory engine is a competitive moat right up until in-context retrieval gets cheap enough that you can just stream the relevant slice. Harness components, in short, have a deprecation half-life that is shorter than most teams want to admit, and the discipline of harness engineering has to include knowing what to delete when the model catches up.

Reddit's coding-agent communities are already running ahead of the labs on this point. A loud strand of pushback in the LLM and AI-agent subreddits argues that piling on skills docs, architecture notes, and sub-agents actively poisons the context window, that you should let the LLM find and manage its own context, and that sub-agents are great only if you have tokens to burn. Others are converging on tiered trust patterns (architectural decision records as trusted, reviewed agent output as semi-trusted, scratch as untrusted) and typed query interfaces over REST or grep for small-model coding agents. The signal underneath all of it is the same: context and harness engineering are now real disciplines, but the artifacts they produce are perishable. Builders who treat their harness like infrastructure rather than scaffolding will spend 2026 carrying a lot of cruft into 2027.

Historical Context

2025-06-25

Argues on X for context engineering over prompt engineering, kicking off the rebrand that the rest of the field adopts.

2025-07-02

Publishes Context Engineering for Agents, formalizing the write, select, compress, and isolate taxonomy that becomes the field's default vocabulary.

2025-11-26

Releases Effective harnesses for long-running agents, establishing the initializer-plus-coder pattern as a reference architecture.

2026-03-26

Submit Natural-Language Agent Harnesses to arXiv, proposing executable natural-language documents and a shared runtime to make harness policy inspectable.

2026-04-04

InfoQ reports Anthropic's three-agent harness that splits planning, generation, and evaluation to keep multi-hour sessions coherent.

2026-04-19

Publishes Agent Harness Engineering, synthesizing the field and crystallizing the Agent = Model + Harness framing for a wider engineering audience.

2026-05-14

Open-sources TencentDB Agent Memory under MIT, shipping a 4-tier local memory pipeline with benchmarks against OpenClaw, WideSearch, PersonaMem, and SWE-bench.

Power Map

Key Players

Subject

Context and harness engineering as the new foundation for reliable AI agents

Anthropic

Publishes the canonical harness-engineering references and ships the Claude Agent SDK, effectively setting the industry vocabulary for initializer/coder splits and three-agent planner-generator-evaluator harnesses.

Tencent Cloud

Released the first widely visible MIT-licensed long-term memory engine for AI agents, putting a four-tier memory architecture and concrete benchmarks into the open-source commons.

Andrej Karpathy

Popularized the term context engineering over prompt engineering, anchoring the new vocabulary that the rest of the field has rallied around.

Tobi Lutke

Co-popularized context engineering from the operator side, framing it as the discipline of supplying everything a task needs to be plausibly solvable by the model.

LangChain

Codified the four-strategy framework (write, select, compress, isolate) that practitioners now reach for when describing what context engineering actually does.

Pan, Zou, Guo, Ni, Zheng

Authored the Natural-Language Agent Harnesses paper that pushes harness policy out of opaque code and into inspectable executable documents.

Fact Check

9 cited

Source Articles

Top 5

THE SIGNAL.

Analysts

"Calls context engineering the delicate art and science of filling the context window with just the right information for the next step, and argues the framing is more accurate than prompt engineering for industrial LLM apps."

Andrej Karpathy

AI researcher, formerly Tesla and OpenAI

"Frames the core open problem bluntly: getting agents to make consistent progress across multiple context windows remains an open problem, which is why harness scaffolding rather than a longer context is the real fix."

Justin Young

Engineer, Anthropic

"Argues that separating the agent doing the work from the agent judging it proves to be a strong lever for keeping long-running runs coherent."

Prithvi Rajasekaran

Engineering lead, Anthropic Labs

"Compresses the new doctrine to a single line: a decent model with a great harness beats a great model with a bad harness."

Addy Osmani

Engineering leader, Google Chrome

"Splits the agent stack into Agent = Model + Harness, and tells builders that if you are not the model, you are the harness, naming the new division of labor in AI engineering."

Viv Trivedy

Practitioner cited in Osmani's essay on harness engineering

"Defines context engineering as the art of providing all the context for the task to be plausibly solvable by the LLM, reframing prompting as a supply problem rather than a wording problem."

Tobi Lutke

CEO, Shopify

"States that context engineering is effectively the number-one job of engineers building AI agents, putting it above model choice or prompt wording on the priority list."

Cognition AI

Team behind the Devin coding agent

The Crowd

"+1 for "context engineering" over "prompt engineering". People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window"

@@karpathy14353

"New on the Anthropic Engineering Blog: Long-running AI agents still face challenges working across many context windows. We looked to human engineers for inspiration in creating a more effective agent harness."

@@AnthropicAI3358

"there's a new concept I'm seeing emerging in AI Agents (especially coding agents), which I'll call "harness engineering" - applying context engineering principles to how you use an existing agent Context engineering -> how context (long or short, agentic or not) is passed to an"

@@dexhorthy586

"Are agent context engines actually becoming a thing?"

@u/regular-tech-guy10

Broadcast

Advanced Context Engineering for Agents

Context Engineering for Agents

Context Engineering for AI Agents with LangChain and Manus