The validator determines done, not the loop
Everyone is building agentic loops that let the model decide when it is done and retry until it is satisfied. I built a three-tier escalation on a DGX Spark, gated by a deterministic test harness, and metered every attempt. For the local model, the repair loop added nothing. Escalation gated by the evaluator took the pass rate from 25% to 75%. The authority that determines done is the test, not the model, and not the loop.
By Keith Townsend · July 1, 2026
The verdict
Measured on deterministic coding tasks, where a test harness gives an unfalsifiable pass/fail. The rulings are about where the "determines done" authority sits in an agentic loop. The open edge is domains without an executable evaluator.
Self-funded. No vendor paid for this answer, and the whole run cost about $4.60.
Video
How I know
The experiment reframes loop control away from who controls the loop and toward who determines done. The analogy is a support desk: an L1 analyst attempts the ticket, but a deterministic escalation rule, not the analyst’s self-assessment, decides whether it goes to L2. Here the L1 is a local model on a DGX Spark, the evaluator is a test harness, and the frontier model is the L2 it escalates to.
The evaluator is not a model. It is a deterministic test harness that runs the same tests against every tier’s output and returns the same verdict regardless of which model wrote the code. It cannot be persuaded, anchored, or biased. That property is what makes the escalation chain reliable, and it is the finding the rest of the numbers support.
What the bench measured
The detail
The experiment ran two phases on the same DGX Spark: a subjective RSS-triage phase that exposed failure patterns, and a coding phase with deterministic evaluation. The coding phase carries the findings, because the test harness gives an unfalsifiable pass/fail signal that a subjective gold key cannot. Bug-fix tasks were mined from real open-source repositories by git archaeology: roll the source back to before a fix, and use the fix’s own tests as the gate.
Repair loops are model-dependent, and that is the finding that reorders the rest. For a deterministic local model at temperature 0, the same input produces the same output, so error feedback changes the prompt text but not the answer. On near-miss tasks where six of seven tests passed, all four repair attempts produced byte-identical output: four times the latency, zero improvement. For a non-deterministic reasoning model the story inverts. The evaluator’s test output steers the model toward a genuinely different attempt, and one round recovered 67% of near-misses. The variable that decides whether a repair loop is worth running is not the quality of the feedback. It is whether the model can produce a different output given the same feedback.
Reviewing everything makes it worse. Every configuration that applied a review pass to all outputs degraded accuracy, whether the reviewer was the local model or a frontier governor. The failure was consistent: asked to review a decision, the model promotes correct low-confidence calls to confident wrong ones, because it asks what if this matters more often than what if it does not. A frontier reviewer shown the local model’s answer also anchored to it, confirming a call it would have made differently on a fresh look. Review earns its place only when a deterministic trigger says an output is worth challenging.
Tiered escalation is the architecture that pays. Local-only cleared 25% of the tasks. Local plus a repair loop cleared the same 25%. Three tiers, local then frontier then most-capable, gated by the evaluator at each step, cleared 75%. The gain did not come from iterating with the model that already failed. It came from handing the task to a more capable one the moment the evaluator rejected, from a clean workspace, with no prior-tier context to anchor it.
The cost model is the appeal. Each tier starts free on the local box and pays for a frontier call only when the evaluator rejects, so cost scales with the local model’s failure rate: about three cents a task at a 75% local pass rate, sixteen cents at the 25% this task set produced, and never more than frontier-only. The whole five-day run cost about $4.60 across 156 API calls. Two hardware notes carried the local tier: a mixture-of-experts model in FP8 ran at 48 tokens a second on the GB10 against 1 for a dense model, which is what makes a free local first attempt practical; and published token rates overstated the real frontier bill by four to five times, so paper cost projections are not the real economics.
Two things set the ceiling, and neither is the loop. Model capability came first: a mixture-of-experts model passed tasks that a larger dense model and a coding-specialized model failed across every mode, so architecture amplifies capability but cannot substitute for it. The evaluator came second: the test harness is the system’s judgment, and a task with a bad test was unsolvable by every tier regardless of capability. The lever is not prompt engineering. It is the quality of the deterministic check.
So the control point in an agentic loop is not the loop. It is the deterministic code around it: the evaluator that determines done and the policy that decides when to escalate. That is where the reasoning-plane authority sits, and it is a thing you own rather than an autonomy you grant the model. Deterministic Code In The Loop is not a constraint on the agent. It is the part of the system you can actually trust.
But self-correction loops are the whole point of agents.
They are the part that mostly does not pay. The band where a repair loop adds value, the model understands the fix but makes an implementation error a round of feedback can correct, was about 5% of tasks. Difficulty is bimodal: 42% of tasks were trivially easy and passed on the first attempt, 54% were too hard for a single pass and needed a more capable model, and only the thin band between them is where iteration helps. That is too narrow to build a production strategy around.
And for a deterministic model the loop pays nothing at all. On near-miss tasks where six of seven tests passed, all four repair attempts produced byte-identical output. The win is not the loop. It is the deterministic evaluator gating escalation to a more capable model, and that is a policy you own, not an autonomy you grant the model.
Where each layer belongs
| Layer | Placement |
|---|---|
Layer 0 · Compute Compute & Network Fabric Local first-attempt tier on the DGX Spark. A mixture-of-experts model in FP8 (48 tok/s vs 1 for a dense model) makes a free local attempt practical. | Retained |
Layer 2B · Runtime Application Runtime & Execution The test harness and the escalation orchestrator are deterministic code you run, not a model. | Retained |
Layer 2C · Reasoning Agentic Infrastructure — The Reasoning Plane The determines-done authority. It stays with the deterministic evaluator and the escalation policy. Frontier model capability is ceded and rented on demand, but the decision to escalate and the accept/reject verdict are never the model’s. | Retained |
The two questions this lab now knows to ask
At temperature 0 the local repair loop added nothing, and tiered escalation made higher temperatures unnecessary to test. Whether sampling breaks the convergence and reopens the local repair loop as a cheaper alternative to escalation is an unexplored axis.
The finding rests on a deterministic test harness. Coding has one. Triage, content generation, and planning do not: the evaluator signal becomes a weaker proxy, and the subjective phase of this experiment showed proxy-gated review is fragile. Whether determines-done authority can be made deterministic enough outside code is the open question, and it is where DCITL either generalizes or stops.
What it did not prove
- The task set is small. The culminating escalation ran on 8 tasks (26 calibrated). Good for the architecture verdict, not a benchmark of pass rates.
- Two of the failures were not capability findings. One task had a bad test that passes on buggy code; another was an edge-case blind spot shared across every model family.
- The third tier did not earn its place. The most-capable model solved nothing the middle tier failed, in this set. It is an architectural proof point, not a demonstrated advantage.
- Scale economics were not measured. A better-calibrated local model with a 50 to 70% pass rate would change the cost curve; this set produced 25%.
- The evaluator only checks what the tests cover. A correct fix that fails a bad test, and a wrong fix that passes thin tests, are both invisible to it.
Notes from the author, Keith Townsend
I started by asking where judgment should live in a local-first agentic system. The data made me change the question. It is not where judgment lives. It is who determines done, and the honest answer is that it should not be a model at all.
The throughline across these labs surprised me again. Lab one found that managed retrieval hides the chunking. Lab two found that managed fine-tune keeps the weights. This lab found that the agentic loop hands you an autonomy you should not want: the model deciding its own work is finished. Every layer offers to make a decision for you that you are better off keeping. The reasoning plane is no different, and the thing you keep is the deterministic check.
The part I keep coming back to is the temperature-0 result. Four repair attempts, byte-identical output, four times the latency, nothing gained. We have built an industry reflex around agents that loop until they are satisfied. On the hardware most people will actually run, for a model deciding its own done, that loop is theater. The evaluator is the only part that was ever doing the work.
Assessments at the time of the lab
Method and disclosure
Self-funded, no sponsor, run over five days on a single NVIDIA DGX Spark (GB10) with real open-source bugs and about $4.60 of frontier API spend. Local models ran on Ollama and vLLM; the frontier tiers were o3 and gpt-5.5.
Tasks were mined from real repositories (more-itertools, httpx, PyJWT, h2, and others) by git archaeology: roll the source back to before a fix, and use the fix’s own test suite as an unfalsifiable pass/fail gate. Each tier ran from a clean workspace with no prior-tier context, and the deterministic harness was the sole authority on accept, reject, and escalation.
The experiment code, the mode definitions, the per-run traces, and the failure-mode catalog are captured in the raw lab detail. The specific task set and the tuning particulars stay in the working notes.
Download the raw lab detail (Markdown)