Own the weights, or the platform owns you
Lab one found the Spark loses to the cloud on inference. That verdict held only for commodity base models. The moment you need a custom model you own, the cloud stops selling tokens and starts renting you floors, and the managed path takes something you cannot get back: the weights. Throughout, the box means the compute, serving, and training you keep below the platform’s abstraction instead of ceding them, and the NVIDIA DGX Spark is where this lab draws that line.
By Keith Townsend · July 1, 2026
The verdict
The economics are arithmetic once you know the floor, and they sit below as evidence. These are the rulings: where authority goes when you customize, and what each managed layer takes in exchange for convenience. One caveat, not a hedge: the box that wins here is a contained, air-cooled, plug-in system. This is not a case for on-prem at scale, where cooling, power, and facilities re-enter the math and this lab did not go.
Self-funded. No vendor paid for this answer.
Video
How I know
Lab one left a question open. The Spark loses to commodity cloud inference, but does that hold when the workload needs a custom model? It does not. Commodity per-token pricing is a property of base models, not of the cloud. For a model you fine-tune, the cloud offers only floors, and against a floor the owned box changes the math.
The lab set out to move that line outward, lifting the stack to a GB300-class cluster, and pivoted twice. The access walls and a memory-axis analysis killed the scale-up, and the genuinely unanswered question turned out to be the economics of serving a custom model. The richest results came from the friction, not the compute: what the managed fine-tune keeps, what the newest model cannot be trained on, and how scarce accelerators are actually procured.
What the bench measured
The detail
The reversal is precise. Lab one’s "the Spark does not earn its keep" was measured against commodity per-token pricing, and that pricing is a property of base models, not of the cloud. Fine-tune the model and the token price is gone. What remains are floors: Provisioned Throughput billed by the hour, continuously, or Custom Model Import billed by the active window. Against a floor, a flat monthly box wins on cost above a low utilization line. The Spark never lost to the cloud. It lost to commodity tokens, and custom weights are exactly where commodity tokens do not exist.
Fine-tuning earned its place, which was not a given. Base model plus retrieval scored 23 to 47% accepted output on the frozen question set, and the failures were grounding, not formatting, so few-shot and constrained decoding would not have closed them. A free local LoRA, thirty-five minutes on the box, reached 70%, beating the base model served on Bedrock. The kill-criterion the lab was built to respect, if base plus retrieval clears the bar then do not fine-tune, did not fire. The custom model was worth building.
The managed fine-tune is a one-way door, and that is the authority finding. Bedrock native fine-tune has three compounding traps: serving is quota-walled, a no-commitment endpoint returns zero model units and a support case; it is billed continuously with no scale-to-zero; and the job output contains metrics only, with no way to download the weights. Data goes in and no model comes out. There is no train on the managed platform, serve on your own import, because the paths do not bridge. The real axis is not price. It is ownership of the weights, and the managed path keeps them.
Capability was not the differentiator. Quality saturates at the task ceiling, a 4B model ties a 14B on a bounded task, so a fine-tuned Gemma is not meaningfully more capable than a fine-tuned Llama for this work. Speed is the axis instead. A 26B mixture-of-experts with roughly 4B active parameters runs 1.6 times faster than a dense 8B on the box, because single-stream decode on bandwidth-bound hardware is set by active parameters, not total, and that advantage is independent of fine-tuning. The self-host optimization is the fastest model that clears the task ceiling, not the biggest.
The newest model exposed a moving dependency frontier. Fine-tuning the latest architecture hit a four-wall cascade: the released transformers library could not load it, one trainer broke on the next transformers major version, another needed a torchao past what the stable box could hold. The model is servable on the box and not fine-tunable on the stable stack without bleeding-edge tooling. The blocker is not the hardware. It is the abstraction lag, the managed and open frameworks trailing the newest weights.
The access reality is the constraint the whole industry runs into. Renting a GB300-class node was not a form, it was a relationship. GPU quota sat at zero across every project, a self-serve request approved in sixty seconds and delivered a soft-deny into the wrong lane, capacity blocks imposed a thousand-dollar one-day minimum, and on the hyperscaler enough memory and on-demand did not coexist. Real access came through account escalation. A neocloud request placed to route around it was still cold a day and a half later. Scarce accelerated capacity is procured through relationships, not forms.
So the box is not the production answer. It is the proxy that exposes the constraint. On the box, the constraint is token latency, and better hardware bends that curve. But it does not remove the tradeoff, it moves it: solve latency with a bigger system and the constraint becomes availability, the same wall the cloud exposes through quota, reservations, and support cases. The old fast, cheap, high quality, pick two no longer fits, because availability is now its own axis. The Layer2C version is fast, cheap, high-confidence, controlled, and available, pick the constraints you are willing to own. Every path has one. The lab shows the floor. The harder question is which constraint you choose to keep.
But the cloud is cheaper.
It is, for base models. Commodity per-token pricing, about forty cents per million output tokens, exists only for a model you did not change. The moment you need custom behavior, that price disappears. The cloud will not sell you cheap tokens on weights you own. It rents you a floor instead.
The floors are the whole story. Custom Model Import bills active windows and scales to zero, about $27.50 per million tokens for this behavior. Native fine-tune serving is Provisioned Throughput, billed continuously. Against a floor, the owned box at a flat $160 to $275 a month wins custom serving above roughly 12 to 20% utilization, lower for the faster mixture-of-experts. Below that, the cloud’s scale-to-zero wins. It is a utilization line, not an hours line.
Where each layer belongs
| Layer | Placement |
|---|---|
Layer 0 · Compute Compute & Network Fabric Compute on the Spark. Scale it up and the constraint moves from throughput to availability. | Retained |
Layer 2B · Runtime Application Runtime & Execution Serving kept local for control and flat cost. The cloud alternatives are Custom Model Import (ceded, scale-to-zero) and Provisioned Throughput (ceded, continuous, quota-walled). | Retained |
Layer 2C · Reasoning Agentic Infrastructure — The Reasoning Plane The weights. A local fine-tune keeps them; a Bedrock native fine-tune cannot export them, so paying to shape the model still cedes it. | Retained |
The two questions this lab now knows to ask
The scale-up hit the same wall everywhere: quota at zero, reservations, support cases. A neocloud request placed to route around the hyperscalers was still cold thirty-six hours later. Whether GPU-native providers remove the availability constraint or just move it to a different commercial surface is the open question, and it is the Fourth Cloud question.
The Spark is a contained best case: air-cooled, plug-in, cheap flat amortization, no facility cost. Scale on-prem and the math changes at every tier. An air-cooled multi-GPU server adds power, heat, and integration; a liquid-cooled rack adds facilities, capital, and specialized operations. Whether owned compute still beats the cloud floor at those tiers is unmeasured, and it is a lab in itself.
What it did not prove
- It did not measure on-prem at scale. The Spark is a contained best case; the owned-box win does not automatically survive the jump to air-cooled multi-GPU or liquid-cooled rack systems, which carry power, cooling, and facility costs this lab did not test.
- Tuned-Gemma quality is inferred from saturation, not judged. The fine-tune dependency cascade blocked the measurement.
- Bedrock native fine-tune quality was not measured. Serving was quota-walled behind a support case.
- Clean Custom Model Import cold-start latency was not isolated. The model warmed between attempts.
- Whether Custom Model Import definitively accepts this model class is unsettled. The supported-architecture list is a moving patchwork.
- Sample size is small, thirty questions on a two-judge gate. Good for go or no-go, not for fine quality claims.
Notes from the author, Keith Townsend
I set out to scale the box up. The lab refused to be that. The access walls and the memory math killed the scale-up on the first day, and the question that was actually unanswered was the one about custom models. I let the lab pivot rather than force the plan. That is what a lab is for.
The throughline between the two labs is the thing I did not expect. Lab one found that managed retrieval hides the chunking decision. Lab two found that managed fine-tune keeps the weights. Managed hosting dictates the supported shapes. The open stack lags the newest architecture. Every managed layer trades one specific control for its convenience, one layer deeper each time, and the fallback for anything outside its supported shape is always the primitive: your own box, or raw compute you build on yourself.
The economics were the easy part. Once you know the floor, the math is arithmetic: fixed local cost, active-window cloud cost, continuous provisioned cost, and the utilization where each one crosses. The hard part is architectural. Every path runs out of road somewhere. Commodity APIs run out when confidence fails. Managed fine-tunes run out when ownership matters. The box runs out when latency matters. Bigger systems run out when availability and capital matter. The question is not who wins. It is which constraint you choose to own.
Assessments at the time of the lab
Method and disclosure
Self-funded, no sponsor, free to mix competitors and to tell you not to buy something. The custom-serving economics compare AWS and NVIDIA paths against an owned box, and every path is held to the same measurement.
The quality gate was a frozen 30-question set judged by a cheap strict judge plus one frontier judge, trusted where they agree. Fine-tuning was a local LoRA on the DGX Spark; serving was measured locally with vLLM, on Bedrock Custom Model Import, and against Bedrock native fine-tune pricing. The data plane was held constant on the lab-one AWS S3 Vectors substrate.
The cost structure, the decision table, the access-friction facts, and the architecture pattern ship. The corpus, the taxonomy, the trained judges and adapters, and the specific analytical conclusions stay proprietary.
Download the raw lab detail (Markdown)