# The validator determines done, not the loop

> Editorial lab · Status: published  
> Published by The CTO Advisor LLC · Layer2C Labs

**Question:** Everyone is building agentic loops that let the model decide when it is done and retry until it is satisfied. I built a three-tier escalation on a DGX Spark, gated by a deterministic test harness, and metered every attempt. For the local model, the repair loop added nothing. Escalation gated by the evaluator took the pass rate from 25% to 75%. The authority that determines done is the test, not the model, and not the loop.

**Load:** Real bug-fix tasks mined from open-source repositories by git archaeology, each with the fix’s own test suite as an unfalsifiable pass/fail gate. Local generation on a DGX Spark, escalation to frontier models only when the evaluator rejected.

## Executive Summary

The experiment reframes loop control away from who controls the loop and toward who determines done. The analogy is a support desk: an L1 analyst attempts the ticket, but a deterministic escalation rule, not the analyst’s self-assessment, decides whether it goes to L2. Here the L1 is a local model on a DGX Spark, the evaluator is a test harness, and the frontier model is the L2 it escalates to.

The evaluator is not a model. It is a deterministic test harness that runs the same tests against every tier’s output and returns the same verdict regardless of which model wrote the code. It cannot be persuaded, anchored, or biased. That property is what makes the escalation chain reliable, and it is the finding the rest of the numbers support.

## DAPM Table — Authority Verdict

| Layer | Placement |
| --- | --- |
| layer0 | Retained |
| layer2b | Retained |
| layer2c | Retained |

## Detailed Writeup

The experiment ran two phases on the same DGX Spark: a subjective RSS-triage phase that exposed failure patterns, and a coding phase with deterministic evaluation. The coding phase carries the findings, because the test harness gives an unfalsifiable pass/fail signal that a subjective gold key cannot. Bug-fix tasks were mined from real open-source repositories by git archaeology: roll the source back to before a fix, and use the fix’s own tests as the gate.

Repair loops are model-dependent, and that is the finding that reorders the rest. For a deterministic local model at temperature 0, the same input produces the same output, so error feedback changes the prompt text but not the answer. On near-miss tasks where six of seven tests passed, all four repair attempts produced byte-identical output: four times the latency, zero improvement. For a non-deterministic reasoning model the story inverts. The evaluator’s test output steers the model toward a genuinely different attempt, and one round recovered 67% of near-misses. The variable that decides whether a repair loop is worth running is not the quality of the feedback. It is whether the model can produce a different output given the same feedback.

Reviewing everything makes it worse. Every configuration that applied a review pass to all outputs degraded accuracy, whether the reviewer was the local model or a frontier governor. The failure was consistent: asked to review a decision, the model promotes correct low-confidence calls to confident wrong ones, because it asks what if this matters more often than what if it does not. A frontier reviewer shown the local model’s answer also anchored to it, confirming a call it would have made differently on a fresh look. Review earns its place only when a deterministic trigger says an output is worth challenging.

Tiered escalation is the architecture that pays. Local-only cleared 25% of the tasks. Local plus a repair loop cleared the same 25%. Three tiers, local then frontier then most-capable, gated by the evaluator at each step, cleared 75%. The gain did not come from iterating with the model that already failed. It came from handing the task to a more capable one the moment the evaluator rejected, from a clean workspace, with no prior-tier context to anchor it.

The cost model is the appeal. Each tier starts free on the local box and pays for a frontier call only when the evaluator rejects, so cost scales with the local model’s failure rate: about three cents a task at a 75% local pass rate, sixteen cents at the 25% this task set produced, and never more than frontier-only. The whole five-day run cost about $4.60 across 156 API calls. Two hardware notes carried the local tier: a mixture-of-experts model in FP8 ran at 48 tokens a second on the GB10 against 1 for a dense model, which is what makes a free local first attempt practical; and published token rates overstated the real frontier bill by four to five times, so paper cost projections are not the real economics.

Two things set the ceiling, and neither is the loop. Model capability came first: a mixture-of-experts model passed tasks that a larger dense model and a coding-specialized model failed across every mode, so architecture amplifies capability but cannot substitute for it. The evaluator came second: the test harness is the system’s judgment, and a task with a bad test was unsolvable by every tier regardless of capability. The lever is not prompt engineering. It is the quality of the deterministic check.

So the control point in an agentic loop is not the loop. It is the deterministic code around it: the evaluator that determines done and the policy that decides when to escalate. That is where the reasoning-plane authority sits, and it is a thing you own rather than an autonomy you grant the model. Deterministic Code In The Loop is not a constraint on the agent. It is the part of the system you can actually trust.

## Assessments at the Time of the Lab

| Vendor | Layer | Grade | As assessed |
| --- | --- | --- | --- |
| NVIDIA AI Platform | Layer 0 · Compute | NVIDIA Strength — Silicon Authority | May 22, 2026 |
| NVIDIA AI Platform | Layer 2C · Reasoning | Runtime Governance Only — Not a Reasoning Plane | May 22, 2026 |

## Method and Disclosure

Self-funded, no sponsor, run over five days on a single NVIDIA DGX Spark (GB10) with real open-source bugs and about $4.60 of frontier API spend. Local models ran on Ollama and vLLM; the frontier tiers were o3 and gpt-5.5.

Tasks were mined from real repositories (more-itertools, httpx, PyJWT, h2, and others) by git archaeology: roll the source back to before a fix, and use the fix’s own test suite as an unfalsifiable pass/fail gate. Each tier ran from a clean workspace with no prior-tier context, and the deterministic harness was the sole authority on accept, reject, and escalation.

The experiment code, the mode definitions, the per-run traces, and the failure-mode catalog are captured in the raw lab detail. The specific task set and the tuning particulars stay in the working notes.

---
*Layer2C Labs · The CTO Advisor LLC · labs.layer2c.com*
