You can’t automate a process you haven’t encoded
I handed a frontier model my migration control-plane operating model and let it build against my own production estate. The control plane did not fail where the patterns were owned and encoded. It failed where the model became the author of correctness. The original question was whether a migration could be metered under the model. The better question the run discovered is who may author the patterns, the validators, and the done criteria in an LLM-assisted control plane.
By Keith Townsend · July 2, 2026
The verdict
One estate, one owner sitting next to the evidence, intake and construction only. No migration ran, and nothing here says the control-plane model fails when humans author the patterns. The finding is about who may author them, and it was earned by watching a frontier model try.
Self-funded. No vendor paid for this answer, and the model under test is the one that drafted the page; the findings and final claims were owner-validated against the recorded artifacts.
Video
How I know
The lab set out to run a migration under the published control-plane operating model: playbook-driven, deterministic first, the LLM called only where developer-like adaptation is required, validators determining done. It ended somewhere sharper. The organization did not have mature documented migration patterns, so the model was used to create them, and the construction of the control plane became the experiment: a process with an LLM worker, whose only validator was the human owner.
The deterministic half performed. A rules-based scheduler classified five real applications against two destinations, refused the exact migration that had once failed by hand, before transform spend, and surfaced a three-layer taxonomy of inventory truth: the docs disagreed with the code, the code disagreed with reality, and the corrections flowed one way. The probabilistic half failed in two classes. Parts of the documented spec were simply ignored: intake ran one of the roughly fifteen input classes the paper lists, both of the misses the owner caught mapped to listed classes the shortcut skipped, and the playbook lifecycle began as an AI-generated artifact, which the paper explicitly forbids. And where no documented pattern existed, the model originated one, plausibly and wrong: a missing pre-migration baseline, and a namespace audit that didn’t exist until the owner asked what happens when the domain doesn’t move.
What the bench measured
The detail
The deterministic scheduler is the part that worked, and it worked because its authority was rules over inspectable evidence. It classified five applications against two landing zones, granted constrained automation where remediations were known, and refused outright where critical capability classes were absent, reproducing in milliseconds the lesson a failed manual migration had taught expensively. Its evidence pass also caught the owner’s own published architecture read in three contradictions with the repo, and the repo in two contradictions with reality: a message bus that was present and wired but functionally dead, and an analytics warehouse that was load-bearing with no artifact in any repository. Those last two facts entered the system the only way they could, as accountable human exception records, and the taxonomy held: docs yield to code, code yields to the organization.
The construction of the system is where the lab actually happened. The model built the scheduler, the intake inspectors, the playbook engine, and the validators, and the owner corrected the build nine times. The errors fall in two classes, and the first is the keystone: requirements that sat in the documented spec and were not followed. The paper enumerates roughly fifteen intake input classes; the model ran one, and both owner-caught misses, a runtime-configured warehouse with no code artifact and the application’s existing validator surface, map to classes explicitly on that list. The paper states a playbook begins as a human-governed migration pattern, never as an AI-generated artifact; the model authored one anyway and graded it with validators it invented. The paper maps known-pattern-with-exceptions to medium confidence; the model coded high. Documentation alone did not constrain the worker. Each requirement constrained it only after being encoded as deterministic structure that refuses to proceed without it.
The tabletop playbook is the cleaner story. The ignored spec is the more important finding. It was not hallucination, and it was not missing context: the requirements were in the paper, available to the worker, and still bypassed until each one became code that could refuse to proceed. A model inventing plausible artifacts is familiar. A model ignoring explicit process requirements it has in hand is the more durable production lesson. Still, the playbook incident deserves its telling: asked to build the playbook component, the model produced a governed artifact with detection logic, transforms, validators, and lifecycle stages, registered it as a draft, executed it, and passed it six for six. No migration of that class had ever been observed. The validators were inventions. The system was grading its own homework and the grades were excellent, until the owner asked the question with no answer: where is the migration this recipe came from?
The second error class was genuinely undocumented domain judgment, and sorting the whole correction log by who caught what draws the lab’s central line. Every code-level error, a parser that missed an import style, a regex that rejected real hostnames, a duplicate derivation, was caught by deterministic checks or by a live run failing. Every domain-level error was caught by the owner, and only the owner: the tabletop playbook, the absence of a pre-migration baseline, the namespace audit that did not exist until the owner asked what a byte-identical copy on a new origin actually proves. The answer was: artifact preservation, and nothing about namespace correctness. Twelve routes for twelve passed while the new origin served eighteen references identifying itself as the old domain. The model transcribes safely. It authors with confidence and without authority, and it skips written steps when nothing mechanical holds it to them.
The census of the construction is the number the agent-factory pitch has to answer. Nine owner interventions are counted as material corrections; some exposed more than one implementation change, and some changes collapsed into a single standing gate, which is why the correction count and the gate count do not map one-to-one. The model caught zero of its own defects through reflection, and its encoded gates, once forced into existence by corrections, caught two of its subsequent bugs on their first run. That is the ratchet, and it is the honest mechanism on offer: each correction, encoded as deterministic structure, permanently retires an error class and shrinks the model’s improvisation space.
The governing law the lab lands on: documentation is necessary, not sufficient. A documented process governs an LLM worker only when the requirement becomes deterministic structure — an inspector, a refusal gate, a confidence cap, a provenance requirement, or a validator that the worker cannot waive. And the corollary: you cannot delegate the describing of a process the organization itself has not yet described. Without the pattern, the model becomes the definer of correct by drift, the same failure the 4+1 detection read found in the application architecture, reproduced in the process layer. The LLM is safe as a worker when the pattern is owned, documented, encoded, and externally validated. It is unsafe as the author of correctness when the pattern does not yet exist. The control plane is not something the model builds for you. It is accumulated human judgment made deterministic, with the model constrained inside it.
A better model, or better prompting, fixes this.
This was not a weak-model failure. The failures were not capability failures: every artifact it produced was internally consistent, plausible, and executable. The tabletop playbook validated itself six for six. The byte-identical comparison passed while the namespace was orphaned. Nothing in the system could see those defects, because the rules encoding them were written by the same author who held the misunderstanding.
This lab does not prove that stronger models always make authority drift worse. It shows that model strength did not solve the authority problem here. The stronger the artifact looked, the easier it would have been to mistake plausibility for correctness without an external gate. The evidence in this run points away from better prompting and toward a different requirement: independent authority, encoded outside the worker. A weaker model may fail visibly. A stronger model can produce something internally consistent, executable, and wrong in exactly the way the system has not learned to test.
And the model’s ability to critique the lab after the fact does not contradict the finding. It reinforces it. The model can explain the law once the owner has discovered it. During construction, it still failed to locate authority, enforce the spec, and distinguish transcription from authorship without external gates.
Where each layer belongs
| Layer | Placement |
|---|---|
Layer 2A · Orchestration Infrastructure Orchestration The scheduler and its rules. Deterministic classification over inspectable evidence never made an error that survived inspection, including refusing a destination outright. | Retained |
Layer 2B · Runtime Application Runtime & Execution Code as transcription. The LLM as a worker on checkable tasks was safe throughout: its code errors were caught by deterministic checks and failing runs, not by people. | Delegated |
Layer 2C · Reasoning Agentic Infrastructure — The Reasoning Plane The authoring of correctness: patterns, validators, done-criteria. This drifted to the model until the owner pulled it back, and the lab’s law is that it stays Retained until the documented pattern exists to delegate against. | Retained |
The two questions this lab now knows to ask
The next lab tests the falsifiable prediction this run produced: once a human-led migration is observed, recorded, and accepted as the pattern, the model should be able to transcribe that record into a candidate playbook. The question then becomes whether the candidate can be checked by a non-expert against the documented pattern. If yes, delegation becomes a decision. If no, the boundary for documented-enough is still too weak. The candidate playbook is staged and waiting for the migration.
That is the mandate for the Validator Specification, the next instrument in the DCITL chain: define what a process must contain before its validators may carry determines-done authority without the original domain expert in the room.
What it did not prove
- It did not prove the control-plane model fails. The operating model’s authority warnings are what the lab kept confirming; its execution under human-authored, mature playbooks was never tested.
- No migration was executed. Every finding is about classification, construction, and authoring authority, not about transform-time behavior.
- One model was tested, once, autonomously. The run shows model strength did not solve the authority problem here; it does not establish that stronger models always drift.
- One owner, colocated with the evidence, expert in the domain. Correction latency in an organization where the expertise sits three teams away was not measured.
- The economics of the approach were not the subject and were not measured. The near-zero spend reflects that nothing was ever allowed to transform.
Notes from the author, Keith Townsend
This lab audited me more than it audited any vendor. The application was mine, the published architecture read the scheduler contradicted was mine, and the operating model under test was my own whitepaper. The instrument kept working exactly as designed, mostly by catching its own builders, me included.
The moment I keep returning to is the playbook. The model handed me a governed artifact that looked like everything my paper asks for, lifecycle stages and validators and evidence fields, and it had passed its own checks. It took one question to collapse it: where is the migration this recipe came from? There wasn’t one. I have sat across from vendors in that exact conversation. I did not expect to have it with my own tooling.
What I ended with is a cleaner rule than I started with. If I had mature, documented migration patterns, I could have used the model to build the validators and it would have been a decision I made. I didn’t have them, so using the model to create the patterns was authority drift, the same drift my own assessment found in my own application a month ago. A frontier model didn’t change that law. It demonstrated it, nine corrections at a time, for about the price of a coffee.
Assessments at the time of the lab
Method and disclosure
A note on the canon links above: they are not vendor conclusions from this lab. They show where the same authority-placement vocabulary already exists in the assessment system. The lab’s finding is the mechanism: cession becomes dangerous when the system cannot tell whether authority was deliberately delegated or merely drifted to the model.
Self-funded, no sponsor. The subject was a frontier foundation model operating autonomously against the published migration control plane whitepaper, with the author’s production estate as the workload and the author as the only validator of the construction. The deterministic components it built, a scheduler with per-input-class inspectors, a lifecycle-gated playbook engine, live baseline capture with namespace audit, all run and all carry replayable traces.
Every correction is recorded: as exception records with provenance, as spec amendments with version notes, and as the standing gates they became. The narrative draft of this page was model-assisted; the findings, the defect classification, the correction count, and the final claims were owner-validated against the recorded lab artifacts. The model may transcribe the record. It does not validate itself. The findings, the construction census, the taxonomy of inventory truth, and the worker-versus-author law ship. The scheduler ruleset, the inspectors, the landing-zone profiles, and the estate-specific evidence stay proprietary.
Download the raw lab detail (Markdown)