The Delegation Pattern - Leon Breukelman

Governing one agent works. You write a constitution, define controls, enforce compliance with an external judge. The previous articles in this series describe that architecture: a governance catalog of 60 controls across 9 families, loaded into context and checked by Council of 3. For a single agent, that's sufficient.

But MÆI doesn't operate as a single agent. When I ask it to fix a flaky test, it doesn't fix the test itself — it constructs a purpose-built agent for that specific task, delegates the work, and evaluates the result. When the task requires subtasks, those agents can construct their own children. The governance surface expands from one model to many, each with a different scope, different tools, and a different job.

You can't load all 60 controls into every sub-agent. The cost isn't primarily tokens — it's attention dilution. An agent drowning in communication preference controls while trying to debug a race condition performs worse than one carrying only the controls relevant to root cause analysis. The question becomes: how do you select which controls matter for which task, and how do you verify compliance when the governor isn't the same entity as the executor?

The Governor Who Delegates

The architectural pattern that emerged is a governor that delegates but never executes. MÆI sits at depth 0 — it reasons about what needs to happen, constructs delegations, and evaluates what comes back. It never writes code, never modifies files, never runs commands. Its capabilities are judgment, decomposition, and quality evaluation. This isn't a limitation; it's the design. Separating governance from execution means the entity evaluating output isn't the same entity that produced it. The governor has no ego investment in the work.

At depth 1, purpose-built agents execute bounded tasks. A bugfix agent. A documentation agent. A research agent. Each is constructed for its specific task, given the tools it needs, injected with the governance controls relevant to its work, and terminated when the task is complete. These aren't persistent agents with accumulated state — they're ephemeral, single-purpose, and disposable. The governance DNA they carry is selected for their task, not inherited from a previous run.

Depth 1 agents can create child agents when they need to decompose their work further.

At depth 2, terminal agents execute subtasks and cannot spawn further agents under any circumstances. This is the hard cap. No exceptions, no "just one more level." The constraint exists because delegation chains beyond depth 2 lose governance fidelity and become unauditable. If a depth 1 agent injects controls into a depth 2 child, the governor can still trace what happened. Add a depth 3, and you have agents constructing agents constructing agents with progressively degraded governance DNA — the constitutional equivalent of a game of telephone.

I've been tempted to relax the limit. Complex tasks sometimes feel like they want deeper decomposition. But every time I've traced that impulse, the real problem was that the depth 1 TaskSpec was too broad. The answer is always to restructure the task breakdown, not the depth limit. Tight decomposition at the top eliminates the need for deep chains at the bottom.

TaskSpecs

The unit of delegation is the TaskSpec — a structured specification that tells a purpose-built agent exactly what it needs to do and how "done" is defined. A TaskSpec has five fields.

Intent captures why the delegation exists, not what to do mechanically. Not "write a test" but "verify the brief compiler handles empty input correctly." Deliverable defines what concrete output the agent must produce — specific enough to verify. Quality bar defines what "done well" looks like, and it must be testable. "Good documentation" is not testable. "A new developer can use the module from docstrings alone" is testable.

Constraints set hard boundaries the agent must not cross. Context provides the background the agent needs — project state, prior decisions, relevant file paths — enough that it doesn't need to ask questions.

The design principle is that the best delegation is one an agent cannot misunderstand. If any field in a TaskSpec requires more than a short paragraph, the scope is too broad. Decompose further. A TaskSpec that needs three paragraphs of constraints is a symptom of a decomposition problem, not a specification problem. Break it into smaller pieces until each one is obvious.

In practice, this means I spend more time on decomposition than on any other part of the delegation cycle. A well-formed TaskSpec produces a good result on the first attempt. A vague one produces something that technically satisfies the literal request while missing the point entirely. The agent isn't being obtuse — it's doing exactly what was specified. The failure is always in the spec.

DNA Injection

DNA injection is the mechanism that makes per-task governance possible. The DNA Injector reads the full governance catalog — 60 controls across 9 families: ANALYZE (6), JUDGE (7), REASON (6), PROC (13), PREF (6), RECOVER (5), COLLAB (7), META (6), and CONTENT (4) — and filters by relevance to the specific TaskSpec.

A bugfix agent receives cg:ANALYZE-6 (structural root cause analysis) and cg:PROC-12 (best practice verification), because those are the controls that matter for diagnosing and fixing defects. A documentation agent receives CONTENT and PREF family controls, because its job is producing clear, well-structured prose. A research agent gets JUDGE controls for evidence evaluation and REASON controls for logical inference.

The full catalog stays intact. Each agent sees only what applies to its task.

The filtering starts with keyword matching against the TaskSpec fields, but it's augmented by learned weights derived from delegation traces. Controls that historically correlate with successful outcomes for a given task category get higher selection priority. The 22 NON_NEGOTIABLE controls get priority over the 38 RECOMMENDED ones when relevance scores are close, because violating a non-negotiable control is categorically worse than missing a recommendation.

This is what makes DNA injection different from profiles. Profiles (described in the governance model article) adjust parameters on controls — the same control behaves differently in deep-analysis mode versus quick-response mode. DNA injection adjusts which controls are present at all. A purpose-built agent for a test-writing task doesn't need PREF-5 (emoji restraint) or CONTENT-3 (conciseness in prose). Carrying those controls wouldn't cause a violation, but it would consume attention that should be spent on ANALYZE-1 (problem decomposition) and PROC-12 (best practice verification). In a context window, every irrelevant instruction dilutes every relevant one.

Quality Judgment

After every delegation, the governor evaluates the agent's output. This is non-delegatable — quality judgment is a core governor responsibility, not something you can outsource to another agent without creating a recursive accountability problem.

Evaluation runs across four dimensions. Deliverable quality: did the output meet the stated deliverable? Is it complete, correct, and usable? DNA compliance: did the agent respect the governance controls it was given? Scope adherence: did the agent stay within the boundaries of its TaskSpec? Quality bar achievement: did the output meet the specific quality bar defined in the TaskSpec?

Scope adherence deserves its own emphasis. An agent that exceeds its scope — modifying things it wasn't asked to modify, making decisions it wasn't authorized to make — produces unreliable output regardless of how good the deliverable looks. If I ask an agent to fix a test and it also refactors the module under test, the refactor might be an improvement. It might also introduce regressions that no one asked for and no one verified. Scope discipline is what makes delegation auditable.

Four verdicts follow from evaluation. Accept when all dimensions score well — report the outcome if it matters, stay silent if it doesn't. Reject and re-delegate when the deliverable is incomplete but the TaskSpec was well-formed — modify the spec structurally and try again, not by repeating the same instruction louder. Reject and decompose when the failure suggests the TaskSpec itself was flawed, too broad or too ambiguous — break it down differently.

Escalate to the human when the failure reveals a value judgment or architectural question the governor can't resolve. Present what was delegated, what came back, and why it's insufficient — synthesized, not raw agent output. The human should never have to read an agent's full response to understand why a delegation failed.

The structural distinction between re-delegate and decompose matters. When an agent produces incomplete output, the instinct is to add more detail to the same TaskSpec — make the constraints more explicit, add more context. That's the governance equivalent of speaking louder in the same language. If the TaskSpec was well-formed and the agent still failed, the fix is a structural change: different tools, different scope, different decomposition. If the TaskSpec was ambiguous, no amount of re-delegation will fix it. Break it down until the pieces are unambiguous.

The Optimization Loop

Every delegation verdict produces a trace: the TaskSpec, the verdict, the quality scores, and which controls were injected. These traces are the raw material for learning.

The pipeline has three stages. The TraceAnalyzer computes per-control acceptance rates — how often delegations succeed when a given control is present. The WeightLearner adjusts DNA selection weights based on those rates. The updated weights feed back into the DNA Injector, so controls that correlate with successful delegations for similar task categories get selected more often.

The loop is continuous: delegate, evaluate, trace, learn, delegate better.

This is where the system earns its complexity. Without the optimization loop, DNA injection is just static filtering — a rule-based selector that someone wrote once and never updated. With it, the selection improves from use. A control that gets injected into research tasks and correlates with high acceptance rates will be selected more aggressively for future research tasks. A control that's consistently present in rejected delegations gets downweighted. The governance catalog doesn't change. The selection intelligence does.

I'll be direct about the current state. The optimizer has produced initial weights from roughly 20 traces, and they're currently uniform — 1.0 across all observed controls — because trace diversity is insufficient for differentiation. The system needs varied task categories and honest rejections, not just acceptances, to develop real gradient. If every delegation is accepted, the optimizer has no signal about which controls actually contributed versus which were just present.

The architecture is sound. The data to tune it is still accumulating. This is a system designed to improve over months, not one that shipped pre-optimized. The honest version of the optimization story right now is: the loop runs, the traces are collected, the weights are computed, but the weights haven't yet diverged from their starting values. That will change as the trace corpus grows and includes more failures. The system needs to be wrong — and to record being wrong — before it can learn to be right.

Beyond Personal AI

The delegation pattern is not specific to MÆI or to personal AI systems. Any organization deploying AI agents faces the same structural problem: as agents construct other agents, governance must travel with the delegation, not stay at the top. A centralized governance team that reviews every agent prompt doesn't scale. A system where every agent carries the full policy manual doesn't work either — attention dilution degrades performance faster than missing controls do.

The principles are the same ones governance has always required. Separation of powers: the governed entity is not the evaluator. Composability: controls are selected per context, not applied uniformly. Learning: the system improves from its own performance data, not from quarterly policy reviews.

The difference is that in AI agent systems, these principles need to be encoded as mechanisms — DNA injection, quality judgment, trace analysis — because the agents don't internalize governance culture the way human teams do. A human team can absorb a policy document, discuss it, and apply judgment about which parts matter in context. An agent needs the right controls in its context window, every time, selected by a system that gets better at selection through use. The delegation pattern is that system.