Editorial lab

Borrow the vendor’s plumbing, not its judgment

I built a retrieval pipeline across a public-cloud data plane and a local box to map where authority actually sits across the 4+1 stack. The economics were the boring part: eighty-four cents, the cloud faster. The finding worth keeping is what the managed path quietly decides for you, and the two questions the lab now knows to ask.

By Keith Townsend · June 27, 2026

The call

The verdict

The cost verdict is commodity knowledge, and it sits below as evidence. These are the rulings that change how the stack gets scored: where authority sits, and what you cede without noticing.

Don’tcede the chunking. The managed RAG wrapper re-chunks on its own strategy and turns on parsing you did not ask for, which would have broken label-filtered retrieval. That decision determines what you can retrieve, and it is yours.

Doborrow the vendor’s plumbing. AWS’s strength is the composable primitives underneath the managed wrapper, not the wrapper. The wrapper is convenience; the primitives are the capability. Score AWS on the primitives.

Doown Layer 2C as the control structure. The reasoning plane that yields a true answer is the gates, the judges, and the reproducibility around the model, not the weights. Cede the model and you still own the harder half.

Doplace the box by its authority, not its benchmarks. The Spark keeps real local execution authority; its place is development and portability, not the production inference tier or the cost win.

Self-funded. No vendor paid for this answer.

The walkthrough

Video

The bench

How I know

I paired the NVIDIA DGX Spark with Amazon S3 and S3 Vectors as the data plane, in the AWS-managed configuration. Validated on a 5,000-segment subset, then run across the full 13-year corpus, roughly 237K segments and 5,815 videos. The verdict is earned, not projected.

The data plane is cheap, fast, and better than expected. The full corpus embeds and indexes for about $0.84 one-time in roughly 41 minutes, and S3 Vectors turned out to do hybrid retrieval, semantic similarity plus metadata filtering in one call, which is what made label-scoped retrieval work. Retrieval was never the constraint. And the architecture itself held: the 4+1 seams fell where the model predicted, and the public-cloud-to-edge I/O across them was never the bottleneck.

What the bench measured

Corpus

2013 to 2026; ran the full corpus, not a sample

~237K segments · 5,815 videos

Ingest throughput

full corpus indexed in ~41 min, 0 failures

337K segments/hour

Retrieval latency (from Spark)

network-bound; S3 Vectors service ~100ms

p50 280ms · p95 505ms

Generation, local (Gemma 4 26B-A4B)

MoE; beats a dense 8B on speed and capability

~20-23 tok/s

Same model on Bedrock

7 to 9x faster than local

136-214 tok/s

Setup cost (embed + index)

plus ~$0.07/mo to keep the index

$0.84 one-time

Inference cost (full analysis pass)

the entire 13-year generation run, at cloud rates

~$0.20

The detail

The clean architecture is the one where the vendor boundary respects the layer boundary. Embedding is a Layer 1B function, data preparation, even though it runs a model on a GPU. So the whole data plane goes to AWS, and NVIDIA keeps Layer 0 and Layer 2. Each vendor owns whole layers, authority places cleanly, nothing is split. Embedding locally would fracture Layer 1 for no gain, since the corpus is public.

A wrinkle at the bench taught the sharpest plumbing lesson. The fully managed path, Bedrock Knowledge Bases, would not create through the API even for an administrator account, so the same Titan-into-S3-Vectors substrate was self-orchestrated with direct calls. The real loss in the managed path was not orchestration, it was the chunking knob: managed re-chunks with its own strategy and turns on parsing you did not ask for. Bring-your-own kept one segment as one vector with its labels attached, and that is what made hybrid label-filtered retrieval possible. Borrow the vendor’s judgment on plumbing. Keep your own on chunking.

Then the numbers reordered the assumptions. The data plane is cheap and fast. Retrieval was never the problem; the latency seen from the Spark was its own network path, not S3 Vectors. The constraint is local generation.

The architecture itself passed, which is the result the economics can obscure. The 4+1 seams fell where the model said they would. Splitting the stack with the data plane in the public cloud and the reasoning at the edge is an operable partition, not just a diagram: the I/O across that seam, retrieved context down to the box and queries up, was never the bottleneck. The constraint lived inside Layer 2, in local generation throughput, not at the boundary between layers. The designed partition holds.

The box, by contrast, did real work, and that is the fair way to say it. It built and ran the whole pipeline end to end. What it isn’t, at this size, is a match for the public cloud on price for performance. The same model runs 7 to 9 times faster on Bedrock for less money, and the project’s entire local generation would have cost about twenty cents in the cloud. So the GB10 is not the production inference tier. But the box doing real work, and the architecture holding, are the results that travel. The same containers and serving stack are meant to lift unchanged to a bigger platform, so the honest close is not that the box failed. It is that this box, at this size, is not price-competitive against commodity per-token pricing, which exists only for base models. The next lab takes the custom-model regime, where that price disappears and AWS charges a floor instead. One model-class note worth keeping: a mixture-of-experts model with ~4B active parameters beat a dense 8B on both speed and capability on this hardware, because single-stream decode is memory-bandwidth-bound and active parameters are what move.

There is a second axis the throughput numbers hide. Faster is not better. The mixture-of-experts model won single-stream speed, but a dense Gemma 4 31B reasoned better over the same data, and stepping up to foundation models reasoned better still. That moves the economics from cost-per-token to value-per-answer: a more capable model costs more per token and can still be the cheaper choice in net, when its output carries more business value than the marginal cost. Cheap and fast minimizes the token bill. It does not maximize the worth of the answer. So the open question is not only how cheap the tokens are, but how much capability, measured in value, not throughput.

The transferable lesson runs past the substrate. Generating LLM findings is free and easy. Generating true ones is not. It takes a quantitative gate, cross-run reproducibility, and a strong judge, and judge strength dominates: a mid-tier judge will rubber-stamp a confidently wrong pattern that a strong one refutes. The fix is not the most expensive judge. A cheap strict judge plus one frontier judge, trusted where they agree, did the work. Most naive discoveries did not survive that gate.

The obvious objection

But what about utilization?

The on-prem case is to keep the box busy so the capex pays off. I ran that math. Pegged at 100% for three years, the Spark generates roughly 1.9 billion tokens, which spreads the $4,699 to about $2.50 per million tokens, plus around $0.40 in power. Call it ~$2.90 per million, fully utilized. The same model on Bedrock is about $0.40 per million. So even maxed out for three years straight, the owned box costs roughly seven times more per token.

Utilization is the wrong lever. The box is memory-bandwidth-bound at about 20 tokens a second, so it produces too few tokens an hour for the capex to ever spread thin enough to win. And to hold 70 to 90% you would need a constant firehose of batch work feeding a box that loses even when fed. You cannot util your way past a throughput ceiling.

Where it belongs

Where each layer belongs

Layer	Placement
Layer 0 · Compute Compute & Network Fabric Compute stays on the Spark.	Retained
Layer 1A · Storage Data Storage & Governance AWS owns the data plane: Titan embeddings into S3 Vectors.	Ceded
Layer 1B · Retrieval Context Management & Retrieval Embedding and retrieval are managed by AWS, so Layer 1 stays whole on one vendor.	Ceded
Layer 2B · Runtime Application Runtime & Execution Serving stays on the Spark as built, though cloud runs the same model faster and cheaper.	Retained
Layer 2C · Reasoning Agentic Infrastructure — The Reasoning Plane Reasoning stays on the Spark.	Retained

What it opened

The two questions this lab now knows to ask

What does a specialist data platform add over the hyperscaler’s native Layer 1?

The AWS-native data plane was cheap and did hybrid retrieval in one call. So where does VAST Data, or any performance or portable data platform, earn its keep over S3 Vectors, on latency, on portability, on retained governance? I did not consider that a question worth asking until the bench showed how good the native floor was. Now there is a measured floor to test it against.

When does a custom model change the economics?

The cost verdict was measured against commodity per-token pricing, which exists only for base models. The moment a workload needs a fine-tuned model, AWS stops selling tokens and charges a floor instead. Where the owned box wins in that regime is the next lab.

The bound

What it did not prove

It did not prove on-prem inference never pays. It proved commodity base-model generation, on this box, for this workload, loses to the cloud on cost and speed.
It did not settle custom or fine-tuned model economics. That is the next lab.
It did not prove managed RAG is bad. It proved the wrapper can hide control points, chunking above all.
It did not test capability against the frontier. An 8B-class model is a cost-structure vehicle, not a frontier contender.

In the author’s words

Notes from the author, Keith Townsend

Most infrastructure assessments are built from briefings. A vendor walks you through a deck, you read the datasheet, you grade the box. I don’t work that way. I put the claims in a lab and watch where the authority actually sits.

This one built a retrieval-augmented generation (RAG) pipeline over thirteen years of transcripts, on a DGX Spark sitting next to AWS. It cost about a dollar to run. It changed how the 4+1 instrument scores one of the biggest clouds, and it changed it because I hit the wall myself.

The managed service, Bedrock Knowledge Bases, looked like the product. When I went to stand it up, the control plane was gated, so I rebuilt the same data plane on the primitives underneath: Titan embeddings, an S3 Vectors index, a query. It worked. That’s the finding. The managed wrapper is a convenience, not a capability, and an architect who self-orchestrates keeps the retrieval decisions instead of inheriting them. The instrument says so now.

The lab also caught me being wrong. My first notes said a frontier model wasn’t available on one cloud. It was. I’d failed to access it, which is a different problem, and I corrected the record instead of shipping the mistake. That’s the point of a lab. It tells you you’re wrong before a reader has to.

An analyst grades the deck. I grade the system after I’ve run it. The instrument is harder to argue with for exactly that reason.

Tied to the canon

Assessments at the time of the lab

AWS AI InfrastructureLayer 1A · Storage

Delegated · as assessed June 20, 2026 · current

AWS AI InfrastructureLayer 1B · Retrieval

Delegated · as assessed June 20, 2026 · current

NVIDIA AI PlatformLayer 0 · Compute

NVIDIA Strength — Silicon Authority · as assessed May 22, 2026 · current

NVIDIA AI PlatformLayer 2B · Runtime

NVIDIA Authority — Inference + Agent Runtime · as assessed May 22, 2026 · current

NVIDIA AI PlatformLayer 2C · Reasoning

Runtime Governance Only — Not a Reasoning Plane · as assessed May 22, 2026 · current

How it was built

Method and disclosure

Self-funded, with no sponsor, so the lab is free to mix competitors. That cross-vendor mix is the editorial signature, nobody here is selling you one box. Every lab, editorial or sponsored, is held to the same method and the same editorial control; sponsored labs simply center on the sponsor’s architecture.

Validated on a 5,000-segment subset, then run across the full corpus plus a 2026 refresh. Generation used Gemma 4 26B-A4B via vLLM on the DGX Spark, chosen over a benchmarked Llama 3.1 8B and a dense Gemma 4 31B. The data plane used Bedrock Titan Text Embeddings v2 into a 1024-dimension cosine S3 Vectors index, with labels kept filterable for hybrid retrieval.

Substrate engineering and the corpus findings are captured as working lab notes, not shipped. The classification taxonomy, the tagged corpus, the trained judges, and the specific conclusions stay proprietary. What ships is the pattern and the substrate verdict, enough to recognize and abstract.

Download the raw lab detail (Markdown)