# Own the weights, or the platform owns you

> Editorial lab · Status: published  
> Published by The CTO Advisor LLC · Layer2C Labs

**Question:** Lab one found the Spark loses to the cloud on inference. That verdict held only for commodity base models. The moment you need a custom model you own, the cloud stops selling tokens and starts renting you floors, and the managed path takes something you cannot get back: the weights. Throughout, the box means the compute, serving, and training you keep below the platform’s abstraction instead of ceding them, and the NVIDIA DGX Spark is where this lab draws that line.

**Load:** A retrieval workload that needs a custom, fine-tuned model, served on an owned DGX Spark versus AWS custom-model products, with the lab-one AWS S3 Vectors data plane held constant.

## Executive Summary

Lab one left a question open. The Spark loses to commodity cloud inference, but does that hold when the workload needs a custom model? It does not. Commodity per-token pricing is a property of base models, not of the cloud. For a model you fine-tune, the cloud offers only floors, and against a floor the owned box changes the math.

The lab set out to move that line outward, lifting the stack to a GB300-class cluster, and pivoted twice. The access walls and a memory-axis analysis killed the scale-up, and the genuinely unanswered question turned out to be the economics of serving a custom model. The richest results came from the friction, not the compute: what the managed fine-tune keeps, what the newest model cannot be trained on, and how scarce accelerators are actually procured.

## DAPM Table — Authority Verdict

| Layer | Placement |
| --- | --- |
| layer0 | Retained |
| layer2b | Retained |
| layer2c | Retained |

## Detailed Writeup

The reversal is precise. Lab one’s "the Spark does not earn its keep" was measured against commodity per-token pricing, and that pricing is a property of base models, not of the cloud. Fine-tune the model and the token price is gone. What remains are floors: Provisioned Throughput billed by the hour, continuously, or Custom Model Import billed by the active window. Against a floor, a flat monthly box wins on cost above a low utilization line. The Spark never lost to the cloud. It lost to commodity tokens, and custom weights are exactly where commodity tokens do not exist.

Fine-tuning earned its place, which was not a given. Base model plus retrieval scored 23 to 47% accepted output on the frozen question set, and the failures were grounding, not formatting, so few-shot and constrained decoding would not have closed them. A free local LoRA, thirty-five minutes on the box, reached 70%, beating the base model served on Bedrock. The kill-criterion the lab was built to respect, if base plus retrieval clears the bar then do not fine-tune, did not fire. The custom model was worth building.

The managed fine-tune is a one-way door, and that is the authority finding. Bedrock native fine-tune has three compounding traps: serving is quota-walled, a no-commitment endpoint returns zero model units and a support case; it is billed continuously with no scale-to-zero; and the job output contains metrics only, with no way to download the weights. Data goes in and no model comes out. There is no train on the managed platform, serve on your own import, because the paths do not bridge. The real axis is not price. It is ownership of the weights, and the managed path keeps them.

Capability was not the differentiator. Quality saturates at the task ceiling, a 4B model ties a 14B on a bounded task, so a fine-tuned Gemma is not meaningfully more capable than a fine-tuned Llama for this work. Speed is the axis instead. A 26B mixture-of-experts with roughly 4B active parameters runs 1.6 times faster than a dense 8B on the box, because single-stream decode on bandwidth-bound hardware is set by active parameters, not total, and that advantage is independent of fine-tuning. The self-host optimization is the fastest model that clears the task ceiling, not the biggest.

The newest model exposed a moving dependency frontier. Fine-tuning the latest architecture hit a four-wall cascade: the released transformers library could not load it, one trainer broke on the next transformers major version, another needed a torchao past what the stable box could hold. The model is servable on the box and not fine-tunable on the stable stack without bleeding-edge tooling. The blocker is not the hardware. It is the abstraction lag, the managed and open frameworks trailing the newest weights.

The access reality is the constraint the whole industry runs into. Renting a GB300-class node was not a form, it was a relationship. GPU quota sat at zero across every project, a self-serve request approved in sixty seconds and delivered a soft-deny into the wrong lane, capacity blocks imposed a thousand-dollar one-day minimum, and on the hyperscaler enough memory and on-demand did not coexist. Real access came through account escalation. A neocloud request placed to route around it was still cold a day and a half later. Scarce accelerated capacity is procured through relationships, not forms.

So the box is not the production answer. It is the proxy that exposes the constraint. On the box, the constraint is token latency, and better hardware bends that curve. But it does not remove the tradeoff, it moves it: solve latency with a bigger system and the constraint becomes availability, the same wall the cloud exposes through quota, reservations, and support cases. The old fast, cheap, high quality, pick two no longer fits, because availability is now its own axis. The Layer2C version is fast, cheap, high-confidence, controlled, and available, pick the constraints you are willing to own. Every path has one. The lab shows the floor. The harder question is which constraint you choose to keep.

## Assessments at the Time of the Lab

| Vendor | Layer | Grade | As assessed |
| --- | --- | --- | --- |
| AWS AI Infrastructure | Layer 2B · Runtime | Delegated / Retained | June 29, 2026 |
| AWS AI Infrastructure | Layer 2C · Reasoning | Intelligence 2C: Delegated | Infra 2C: Implicit | June 29, 2026 |
| NVIDIA AI Platform | Layer 0 · Compute | NVIDIA Strength — Silicon Authority | May 22, 2026 |
| NVIDIA AI Platform | Layer 2B · Runtime | NVIDIA Authority — Inference + Agent Runtime | May 22, 2026 |
| NVIDIA AI Platform | Layer 2C · Reasoning | Runtime Governance Only — Not a Reasoning Plane | May 22, 2026 |

## Method and Disclosure

Self-funded, no sponsor, free to mix competitors and to tell you not to buy something. The custom-serving economics compare AWS and NVIDIA paths against an owned box, and every path is held to the same measurement.

The quality gate was a frozen 30-question set judged by a cheap strict judge plus one frontier judge, trusted where they agree. Fine-tuning was a local LoRA on the DGX Spark; serving was measured locally with vLLM, on Bedrock Custom Model Import, and against Bedrock native fine-tune pricing. The data plane was held constant on the lab-one AWS S3 Vectors substrate.

The cost structure, the decision table, the access-friction facts, and the architecture pattern ship. The corpus, the taxonomy, the trained judges and adapters, and the specific analytical conclusions stay proprietary.

---
*Layer2C Labs · The CTO Advisor LLC · labs.layer2c.com*