# Lab detail — Fine-tune economics (Spark vs Bedrock custom models)

Raw lab detail for labs.layer2c.com/labs/fine-tune-economics. This is the substrate engineering
and the measured economics.

**What ships:** the cost structure, the decision table, the access-friction facts, the
architecture pattern, and the measured numbers. **What stays proprietary:** the corpus contents,
the classification taxonomy, the trained judges and adapters, the thresholds, and the specific
analytical conclusions drawn from the data. Returns, not algorithms.

## The question

When a workload needs a custom (fine-tuned) model, not a base model, what does it cost to train
and serve it, on an owned DGX Spark versus AWS custom-model products, and where does the owned box
win? Gating sub-question, a kill-criterion: does fine-tuning even earn its place over base model
plus retrieval?

## Setup

- **Data plane:** held constant on the lab-one AWS S3 Vectors substrate (Titan Text Embeddings v2,
  1024-dim cosine index, hybrid label-filtered retrieval). Retrieval is identical across every run;
  only the generator changes, so the generator is the only variable.
- **Local box:** NVIDIA DGX Spark (GB10), 128 GB unified memory.
- **Base model:** Meta Llama 3.1 8B Instruct. Bedrock-fine-tunable, Custom-Model-Import-eligible,
  small enough for a practical local LoRA.
- **Quality gate:** a cheap strict judge plus one frontier judge, trusted where they agree, over a
  frozen 30-question set. Judge internals proprietary.

## The runs (cost-disciplined order)

1. **Base-model control (kill-criterion).** Bedrock base and local Spark base, same prompts. If
   base + RAG clears the accepted-output bar, fine-tuning earns nothing and the lab reports that.
2. **Spark fine-tune + local serve.** LoRA on the box, served with vLLM.
3. **Spark-tuned weights → Bedrock Custom Model Import.**
4. **Bedrock native fine-tune + Provisioned Throughput** (the priciest path, last).

## Measured numbers

### Quality (accepted-output rate, two-judge gate, N=30)

| Generator | Accepted | Note |
|---|---|---|
| Spark base (Llama 3.1 8B) | 23% | grounding failures dominate |
| Bedrock base (same weights) | 47% | serving-path differences |
| Spark fine-tuned (LoRA) | 70% | +47 pts, 3x; training loss 0.34 → 0.22 |
| CMI (Spark-tuned weights) | ~70% | same weights, by construction |
| Bedrock native FT | ~70% (inferred) | serving quota-walled, not measured |

Caveat: N=30 with judge disagreement present, so error bars are wide. Good for go/no-go, not for
fine quality claims.

### Speed (single-stream, same RAG harness)

- Llama 3.1 8B (dense): **14 tok/s**
- Gemma 4 26B-A4B (MoE, ~4B active): **22.4 tok/s (1.6x)**
- CMI (cloud GPUs): **95 tok/s (6.8x the Spark)**

### Serving economics (cost per 1M output tokens, custom 70% behavior)

| Path | Serving | $/1M tok | Accessible |
|---|---|---|---|
| Bedrock base commodity | per-token | ~$0.40 | yes, base model only |
| Spark tuned | flat amortized (~$160–275/mo) | $5.4 ÷ U (Llama) / $3.4 ÷ U (Gemma) | yes, own the box |
| CMI (tuned) | active window, scale-to-zero | $27.5 | yes, self-serve |
| Bedrock native FT + PT | continuous Provisioned Throughput | ~$20.8 ÷ duty | no: quota-walled + no weight export |

Crossover, Spark beats CMI per-token: **~20% utilization (Llama) → ~12% (Gemma, faster).** Below
that, CMI's scale-to-zero wins. The Spark's 14–22 tok/s is also a hard capacity/latency ceiling
above which the cloud wins on capability at any price.

### Fine-tune

- Local LoRA: **~35 minutes on the box, $0 marginal.** Training loss 0.34 → 0.22.
- Adapter and merged weights owned and portable.

## The gotchas (the bench-only findings)

### Weight export — the one-way door

Bedrock native fine-tune has three compounding traps: (1) serving is quota-walled — a
no-commitment Provisioned Throughput endpoint returns 0 model units and a support case; (2) it is
billed continuously, no scale-to-zero; (3) the job output is metrics only, with no weight download.
Data goes in, no model comes out. There is no "train on Bedrock, serve on CMI" — the paths do not
bridge. The real axis is ownership of the weights, not price.

### Fine-tune dependency cascade (newest model)

Fine-tuning the latest architecture hit a four-wall cascade: the released `transformers` cannot
load it → one trainer breaks on the next `transformers` major (`HybridCache`) → another needs
`torchao` past what the stable box's `torch` supports. The model is **servable** on the box and
**not fine-tunable** on the stable stack without bespoke, bleeding-edge tooling. The blocker is the
abstraction lag, not the hardware.

### Access friction (the availability constraint)

- **GCP:** GPU quota 0 across projects; a self-serve request "approved" in ~60 seconds but
  soft-denied (1 committed unit, wrong lane, "contact your sales rep"); H200/B200 reservation-only;
  on-demand tiers cap below the model's memory footprint.
- **AWS:** Capacity Blocks impose a ~$1,000 one-day minimum.
- **Neocloud:** a request placed to route around the hyperscalers was still cold ~36 hours later.
- Scarce accelerated capacity is procured through relationships, not forms.

## The economics unit

Compare by **cost per token** (or per accepted answer), not $/hour — the tiers run at very
different speeds, so $/hour is meaningless. CMI's 6.8x speed makes its per-token cost competitive
despite a higher $/hour. The Spark-vs-cloud crossover is a **utilization** figure, not an hours
figure.

## The architecture pattern

The Spark is the **proxy that exposes the constraint**, not the production tier. On the box, the
constraint is token latency; a bigger system bends that curve but moves the constraint to
**availability** (quota, reservations, provisioned throughput, capacity blocks, support cases).
Availability is its own axis now: fast, cheap, high-confidence, controlled, and available — pick
the constraints you are willing to own.

The throughline across the labs: every managed layer trades a specific control for its convenience.
Managed RAG hides the chunking knob (lab 1). Managed fine-tune keeps the weights (lab 2). Managed
hosting dictates the supported shapes (CMI). The open stack lags the newest architecture. The
fallback for anything outside a managed layer's supported shape is always the primitive: your own
box, or raw compute you build on yourself.

## What did NOT get measured (honesty)

- Tuned-Gemma quality (inferred from saturation, not judged — blocked by the dependency cascade).
- Bedrock native fine-tune quality (serving quota-walled).
- Clean CMI cold-start latency (the model warmed between attempts).
- Whether CMI definitively accepts this model class (the supported-architecture list is a moving
  patchwork).
