# Lab detail — Loop Control (who determines done in an agentic loop)

Raw lab detail for labs.layer2c.com/labs/loop-control. A five-day experiment (2026-05-13 to
2026-05-17) on a single NVIDIA DGX Spark, testing whether tiered model escalation gated by a
deterministic evaluator can match frontier-only accuracy at lower cost.

**What ships:** the architecture, the measured results, the cost model, the hardware findings, the
failure-mode catalog, the design recommendations. **What stays proprietary:** the specific task
set, the extracted tests, and the tuning particulars. Returns, not the algorithm.

## The question

Original: where should judgment live in a local-first agentic AI system? Refined after the data:
**who determines done, and can a deterministic evaluator gate model escalation without human
intervention?**

The reframe: move "loop control" away from *who controls the loop* and toward *who determines done*.
The support-desk analogy — an L1 analyst attempts a ticket, but a deterministic escalation rule (not
the analyst's self-assessment) routes it to L2 — applied to model inference.

## Setup

- **Hardware:** one NVIDIA DGX Spark (Grace Blackwell GB10, 119 GiB unified memory).
- **Local models tested:** nemotron-3-super (123B dense, ~1.8 tok/s, failed repo tasks on format
  violations); Gemma 4 31B dense (~3–4 tok/s, passed); Qwen 2.5 Coder 32B AWQ (~1 tok/s, failed);
  LLaMA 3.1 70B AWQ (~0.5 tok/s, timed out — hardware ceiling); Qwen3.6-35B-A3B FP8 MoE (~48 tok/s,
  fastest); **Gemma 4 26B-A4B MoE (~22 tok/s, primary local worker).**
- **Frontier tiers (OpenAI):** o3 (tier 2, ~$0.03/call billed), gpt-5.5 (tier 3, ~$0.40/call).
- **Tasks:** bug-fix tasks mined from real repos (more-itertools, httpx, PyJWT, h2, and others) by
  git archaeology — roll source back to pre-fix, use the fix's own tests as an unfalsifiable gate.

## The architecture (three-tier escalation, modeE)

```
Tier 1: Local model (free, 1 attempt) -> run tests -> pass? done
                                                       fail? v
Tier 2: Frontier model (2 attempts, error feedback) -> run tests -> pass? done
                                                                     fail? v
Tier 3: Most-capable model (1 attempt) -> run tests -> done
```

Each tier starts from a clean workspace (original buggy code), no prior-tier context. The evaluator
(test harness) is the sole decision authority between tiers. Attempt budget matches the model: 1 for
a deterministic local model (feedback can't change a temperature-0 output), 2 for a reasoning model
(error output steers a genuinely different retry).

## Measured results

| Metric | Value |
|---|---|
| Local-only pass rate | 25% (2/8) |
| Local + repair loop | 25% (no gain) |
| Three-tier escalation | **75% (6/8)** |
| Repair-loop sweet spot | ~5% of tasks |
| Task difficulty (26 calibrated) | 42% trivial · ~4% medium · 54% too hard for one pass |
| Frontier (o3) second-attempt recovery | 67% of near-misses |
| Total run cost | ~$4.60 across 156 API calls |
| modeE cost at 25% local pass | ~$0.16/task |
| Local throughput, MoE FP8 vs dense | 48 vs 1 tok/s |

modeE cost scales with local failure rate: 100% local pass → $0.00; 75% → ~$0.03; 25% (observed) →
~$0.16; 0% → ~$0.14 (same as frontier-only).

## Key findings

1. **Repair loops are model-dependent.** Deterministic model (temp 0): same input, same output, zero
   value — near-miss tasks produced byte-identical output across 4 attempts. Reasoning model (o3):
   one round of the evaluator's test output recovers 67% of near-misses. The variable is whether the
   model can produce a *different* output given the same feedback.
2. **Tiered escalation is the dominant architecture.** 25% (local) → 25% (local + repair) → 75%
   (three-tier). The gain is adding capability on rejection, not iterating with a failed model.
3. **The evaluator is the authority.** The deterministic test harness provides the escalation signal,
   the accept/reject verdict, and the error feedback. No model judges its own or another's work.
4. **Universal review degrades accuracy.** Every all-outputs review pass made things worse (over-
   saving bias; governor anchoring — a frontier reviewer confirmed a local call it would have made
   differently on a fresh look). Gate review on a deterministic trigger, or skip it.
5. **Cost scales with difficulty; capability > architecture.** Pay for a frontier call only on
   rejection. A MoE model passed tasks a larger dense model and a coding-specialized model failed.
6. **Token-rate estimates overstate real o3 spend 4–5×** (~$0.14/call estimated vs ~$0.03 billed).

## Failure-mode catalog

| Failure mode | Cause | Fix / workaround |
|---|---|---|
| Over-saving bias | Review promotes ignores to saves | Gate review with deterministic triggers |
| Governor anchoring | Reviewer biased by seeing the local output | Present the task fresh, not as a review |
| Format violations (nemotron) | Model puts explanation in the files dict | Switch models (Gemma 4) |
| Deterministic convergence | Same input = same output at temp 0 | Drop repair loops; escalate instead |
| Frontier empty-files | Model returns valid JSON, no code (~25% of calls) | No reliable fix found |
| Wrong Python interpreter | System python lacks pytest | Use the project venv |
| Stale bytecode across tiers | `__pycache__` from a failed tier survives | Clear it at escalation |
| Bad test in extraction | Feature test passes on buggy code | Manual review of extracted tests |

## Design recommendations

- **Two-tier escalation with deterministic gating.** Tier 3 didn't add value in this set.
- **One attempt for deterministic local models; two for reasoning-model tiers.**
- **MoE models for the local tier on edge hardware** (FP8 on Blackwell is the throughput driver).
- **Invest in evaluator quality, not prompt engineering.** The test harness is the system's
  judgment; a bad test makes a task unsolvable regardless of model capability. Evaluator reliability
  is the ceiling.

## What did NOT get settled (honesty)

- Small task set (8 in the culminating run). Architecture verdict, not a pass-rate benchmark.
- Two failures were a bad test and a shared edge-case blind spot, not capability findings.
- Tier 3 added nothing here (proof point, not demonstrated advantage).
- Temperature >0 in the local loop untested; scale economics (a 50–70% local model) not measured.
- Cross-domain: the finding rests on an executable evaluator. Domains without one (triage, content,
  planning) have a weaker signal, not demonstrated here.
