Editorial lab

The validator determines done, not the loop

Everyone is building agentic loops that let the model decide when it is done and retry until it is satisfied. I built a three-tier escalation on a DGX Spark, gated by a deterministic test harness, and metered every attempt. For the local model, the repair loop added nothing. Escalation gated by the evaluator took the pass rate from 25% to 75%. The authority that determines done is the test, not the model, and not the loop.

By Keith Townsend · July 1, 2026

The call

The verdict

Measured on deterministic coding tasks, where a test harness gives an unfalsifiable pass/fail. The rulings are about where the "determines done" authority sits in an agentic loop. The open edge is domains without an executable evaluator.

Don’tlet the model decide when it is done. At temperature 0 the local model is deterministic: the same input gives the same output, so a repair loop retries its way to the identical wrong answer. Self-repair added 4x latency and zero accuracy.

Don’treview every output. Universal review degraded accuracy in every configuration tested. The reviewer promotes correct low-confidence calls to confident wrong ones. Gate review on a deterministic trigger, or skip it.

Domake a deterministic evaluator the authority. A test harness decides accept or reject, drives escalation, and supplies the error feedback. No model judges its own work or another model’s. That one change took the pass rate from 25% local-only to 75% with tiered escalation.

Doescalate instead of iterate, and match the attempt budget to the model. One attempt for a deterministic local model; two for a reasoning model, where the evaluator’s test output steers a genuinely different retry. The frontier tier recovered 67% of near-misses on the second attempt. Pay for the capable model only when the evaluator rejects.

Self-funded. No vendor paid for this answer, and the whole run cost about $4.60.

The walkthrough

Video

Video slot — sponsored labs fill this with a series.

The bench

How I know

The experiment reframes loop control away from who controls the loop and toward who determines done. The analogy is a support desk: an L1 analyst attempts the ticket, but a deterministic escalation rule, not the analyst’s self-assessment, decides whether it goes to L2. Here the L1 is a local model on a DGX Spark, the evaluator is a test harness, and the frontier model is the L2 it escalates to.

The evaluator is not a model. It is a deterministic test harness that runs the same tests against every tier’s output and returns the same verdict regardless of which model wrote the code. It cannot be persuaded, anchored, or biased. That property is what makes the escalation chain reliable, and it is the finding the rest of the numbers support.

What the bench measured

Pass rate, local model only

Gemma 4 26B-A4B on the DGX Spark, one attempt

25% (2/8)

Pass rate, local + repair loop

temperature 0: same input, same output, same failure

25% (no gain)

Pass rate, tiered escalation

local, then frontier only when the evaluator rejected

75% (6/8)

Repair-loop sweet spot

bimodal difficulty: 42% trivial, 54% too hard for one pass

~5% of tasks

Frontier second-attempt recovery

evaluator test output steers a non-deterministic reasoning model

67% of near-misses

Total run cost

~$0.16/task at a 25% local pass rate; pay for capability only on rejection

~$4.60 · 156 API calls

Local throughput, MoE FP8 vs dense

FP8 native on the GB10; makes the free tier practical

48 vs 1 tok/s

The detail

The experiment ran two phases on the same DGX Spark: a subjective RSS-triage phase that exposed failure patterns, and a coding phase with deterministic evaluation. The coding phase carries the findings, because the test harness gives an unfalsifiable pass/fail signal that a subjective gold key cannot. Bug-fix tasks were mined from real open-source repositories by git archaeology: roll the source back to before a fix, and use the fix’s own tests as the gate.

Repair loops are model-dependent, and that is the finding that reorders the rest. For a deterministic local model at temperature 0, the same input produces the same output, so error feedback changes the prompt text but not the answer. On near-miss tasks where six of seven tests passed, all four repair attempts produced byte-identical output: four times the latency, zero improvement. For a non-deterministic reasoning model the story inverts. The evaluator’s test output steers the model toward a genuinely different attempt, and one round recovered 67% of near-misses. The variable that decides whether a repair loop is worth running is not the quality of the feedback. It is whether the model can produce a different output given the same feedback.

Reviewing everything makes it worse. Every configuration that applied a review pass to all outputs degraded accuracy, whether the reviewer was the local model or a frontier governor. The failure was consistent: asked to review a decision, the model promotes correct low-confidence calls to confident wrong ones, because it asks what if this matters more often than what if it does not. A frontier reviewer shown the local model’s answer also anchored to it, confirming a call it would have made differently on a fresh look. Review earns its place only when a deterministic trigger says an output is worth challenging.

Tiered escalation is the architecture that pays. Local-only cleared 25% of the tasks. Local plus a repair loop cleared the same 25%. Three tiers, local then frontier then most-capable, gated by the evaluator at each step, cleared 75%. The gain did not come from iterating with the model that already failed. It came from handing the task to a more capable one the moment the evaluator rejected, from a clean workspace, with no prior-tier context to anchor it.

The cost model is the appeal. Each tier starts free on the local box and pays for a frontier call only when the evaluator rejects, so cost scales with the local model’s failure rate: about three cents a task at a 75% local pass rate, sixteen cents at the 25% this task set produced, and never more than frontier-only. The whole five-day run cost about $4.60 across 156 API calls. Two hardware notes carried the local tier: a mixture-of-experts model in FP8 ran at 48 tokens a second on the GB10 against 1 for a dense model, which is what makes a free local first attempt practical; and published token rates overstated the real frontier bill by four to five times, so paper cost projections are not the real economics.

Two things set the ceiling, and neither is the loop. Model capability came first: a mixture-of-experts model passed tasks that a larger dense model and a coding-specialized model failed across every mode, so architecture amplifies capability but cannot substitute for it. The evaluator came second: the test harness is the system’s judgment, and a task with a bad test was unsolvable by every tier regardless of capability. The lever is not prompt engineering. It is the quality of the deterministic check.

So the control point in an agentic loop is not the loop. It is the deterministic code around it: the evaluator that determines done and the policy that decides when to escalate. That is where the reasoning-plane authority sits, and it is a thing you own rather than an autonomy you grant the model. Deterministic Code In The Loop is not a constraint on the agent. It is the part of the system you can actually trust.

The obvious objection

But self-correction loops are the whole point of agents.

They are the part that mostly does not pay. The band where a repair loop adds value, the model understands the fix but makes an implementation error a round of feedback can correct, was about 5% of tasks. Difficulty is bimodal: 42% of tasks were trivially easy and passed on the first attempt, 54% were too hard for a single pass and needed a more capable model, and only the thin band between them is where iteration helps. That is too narrow to build a production strategy around.

And for a deterministic model the loop pays nothing at all. On near-miss tasks where six of seven tests passed, all four repair attempts produced byte-identical output. The win is not the loop. It is the deterministic evaluator gating escalation to a more capable model, and that is a policy you own, not an autonomy you grant the model.

Where it belongs

Where each layer belongs

Layer	Placement
Layer 0 · Compute Compute & Network Fabric Local first-attempt tier on the DGX Spark. A mixture-of-experts model in FP8 (48 tok/s vs 1 for a dense model) makes a free local attempt practical.	Retained
Layer 2B · Runtime Application Runtime & Execution The test harness and the escalation orchestrator are deterministic code you run, not a model.	Retained
Layer 2C · Reasoning Agentic Infrastructure — The Reasoning Plane The determines-done authority. It stays with the deterministic evaluator and the escalation policy. Frontier model capability is ceded and rented on demand, but the decision to escalate and the accept/reject verdict are never the model’s.	Retained

What it opened

The two questions this lab now knows to ask

Would a non-deterministic local model make the repair loop worth running?

At temperature 0 the local repair loop added nothing, and tiered escalation made higher temperatures unnecessary to test. Whether sampling breaks the convergence and reopens the local repair loop as a cheaper alternative to escalation is an unexplored axis.

Does this hold in domains without an executable evaluator?

The finding rests on a deterministic test harness. Coding has one. Triage, content generation, and planning do not: the evaluator signal becomes a weaker proxy, and the subjective phase of this experiment showed proxy-gated review is fragile. Whether determines-done authority can be made deterministic enough outside code is the open question, and it is where DCITL either generalizes or stops.

The bound

What it did not prove

The task set is small. The culminating escalation ran on 8 tasks (26 calibrated). Good for the architecture verdict, not a benchmark of pass rates.
Two of the failures were not capability findings. One task had a bad test that passes on buggy code; another was an edge-case blind spot shared across every model family.
The third tier did not earn its place. The most-capable model solved nothing the middle tier failed, in this set. It is an architectural proof point, not a demonstrated advantage.
Scale economics were not measured. A better-calibrated local model with a 50 to 70% pass rate would change the cost curve; this set produced 25%.
The evaluator only checks what the tests cover. A correct fix that fails a bad test, and a wrong fix that passes thin tests, are both invisible to it.

In the author’s words

Notes from the author, Keith Townsend

I started by asking where judgment should live in a local-first agentic system. The data made me change the question. It is not where judgment lives. It is who determines done, and the honest answer is that it should not be a model at all.

The throughline across these labs surprised me again. Lab one found that managed retrieval hides the chunking. Lab two found that managed fine-tune keeps the weights. This lab found that the agentic loop hands you an autonomy you should not want: the model deciding its own work is finished. Every layer offers to make a decision for you that you are better off keeping. The reasoning plane is no different, and the thing you keep is the deterministic check.

The part I keep coming back to is the temperature-0 result. Four repair attempts, byte-identical output, four times the latency, nothing gained. We have built an industry reflex around agents that loop until they are satisfied. On the hardware most people will actually run, for a model deciding its own done, that loop is theater. The evaluator is the only part that was ever doing the work.

Tied to the canon

Assessments at the time of the lab

NVIDIA AI PlatformLayer 0 · Compute

NVIDIA Strength — Silicon Authority · as assessed May 22, 2026 · current

NVIDIA AI PlatformLayer 2C · Reasoning

Runtime Governance Only — Not a Reasoning Plane · as assessed May 22, 2026 · current

How it was built

Method and disclosure

Self-funded, no sponsor, run over five days on a single NVIDIA DGX Spark (GB10) with real open-source bugs and about $4.60 of frontier API spend. Local models ran on Ollama and vLLM; the frontier tiers were o3 and gpt-5.5.

Tasks were mined from real repositories (more-itertools, httpx, PyJWT, h2, and others) by git archaeology: roll the source back to before a fix, and use the fix’s own test suite as an unfalsifiable pass/fail gate. Each tier ran from a clean workspace with no prior-tier context, and the deterministic harness was the sole authority on accept, reject, and escalation.

The experiment code, the mode definitions, the per-run traces, and the failure-mode catalog are captured in the raw lab detail. The specific task set and the tuning particulars stay in the working notes.

Download the raw lab detail (Markdown)