# Borrow the vendor’s plumbing, not its judgment

> Editorial lab · Status: published  
> Published by The CTO Advisor LLC · Layer2C Labs

**Question:** I built a retrieval pipeline across a public-cloud data plane and a local box to map where authority actually sits across the 4+1 stack. The economics were the boring part: eighty-four cents, the cloud faster. The finding worth keeping is what the managed path quietly decides for you, and the two questions the lab now knows to ask.

**Load:** ~237K transcript segments across 5,815 videos, 2013 to 2026, embedded and queried in batch.

## Executive Summary

I paired the NVIDIA DGX Spark with Amazon S3 and S3 Vectors as the data plane, in the AWS-managed configuration. Validated on a 5,000-segment subset, then run across the full 13-year corpus, roughly 237K segments and 5,815 videos. The verdict is earned, not projected.

The data plane is cheap, fast, and better than expected. The full corpus embeds and indexes for about $0.84 one-time in roughly 41 minutes, and S3 Vectors turned out to do hybrid retrieval, semantic similarity plus metadata filtering in one call, which is what made label-scoped retrieval work. Retrieval was never the constraint. And the architecture itself held: the 4+1 seams fell where the model predicted, and the public-cloud-to-edge I/O across them was never the bottleneck.

## DAPM Table — Authority Verdict

| Layer | Placement |
| --- | --- |
| layer0 | Retained |
| layer1a | Ceded |
| layer1b | Ceded |
| layer2b | Retained |
| layer2c | Retained |

## Seam Map — Readiness

| Function | Readiness |
| --- | --- |
| Embed | needs-cloud |
| Index (store) | ready |
| Batch query | ready |
| Interactive query | conditional |
| Generate | conditional |

## Detailed Writeup

The clean architecture is the one where the vendor boundary respects the layer boundary. Embedding is a Layer 1B function, data preparation, even though it runs a model on a GPU. So the whole data plane goes to AWS, and NVIDIA keeps Layer 0 and Layer 2. Each vendor owns whole layers, authority places cleanly, nothing is split. Embedding locally would fracture Layer 1 for no gain, since the corpus is public.

A wrinkle at the bench taught the sharpest plumbing lesson. The fully managed path, Bedrock Knowledge Bases, would not create through the API even for an administrator account, so the same Titan-into-S3-Vectors substrate was self-orchestrated with direct calls. The real loss in the managed path was not orchestration, it was the chunking knob: managed re-chunks with its own strategy and turns on parsing you did not ask for. Bring-your-own kept one segment as one vector with its labels attached, and that is what made hybrid label-filtered retrieval possible. Borrow the vendor’s judgment on plumbing. Keep your own on chunking.

Then the numbers reordered the assumptions. The data plane is cheap and fast. Retrieval was never the problem; the latency seen from the Spark was its own network path, not S3 Vectors. The constraint is local generation.

The architecture itself passed, which is the result the economics can obscure. The 4+1 seams fell where the model said they would. Splitting the stack with the data plane in the public cloud and the reasoning at the edge is an operable partition, not just a diagram: the I/O across that seam, retrieved context down to the box and queries up, was never the bottleneck. The constraint lived inside Layer 2, in local generation throughput, not at the boundary between layers. The designed partition holds.

The box, by contrast, did real work, and that is the fair way to say it. It built and ran the whole pipeline end to end. What it isn’t, at this size, is a match for the public cloud on price for performance. The same model runs 7 to 9 times faster on Bedrock for less money, and the project’s entire local generation would have cost about twenty cents in the cloud. So the GB10 is not the production inference tier. But the box doing real work, and the architecture holding, are the results that travel. The same containers and serving stack are meant to lift unchanged to a bigger platform, so the honest close is not that the box failed. It is that this box, at this size, is not price-competitive against commodity per-token pricing, which exists only for base models. The next lab takes the custom-model regime, where that price disappears and AWS charges a floor instead. One model-class note worth keeping: a mixture-of-experts model with ~4B active parameters beat a dense 8B on both speed and capability on this hardware, because single-stream decode is memory-bandwidth-bound and active parameters are what move.

There is a second axis the throughput numbers hide. Faster is not better. The mixture-of-experts model won single-stream speed, but a dense Gemma 4 31B reasoned better over the same data, and stepping up to foundation models reasoned better still. That moves the economics from cost-per-token to value-per-answer: a more capable model costs more per token and can still be the cheaper choice in net, when its output carries more business value than the marginal cost. Cheap and fast minimizes the token bill. It does not maximize the worth of the answer. So the open question is not only how cheap the tokens are, but how much capability, measured in value, not throughput.

The transferable lesson runs past the substrate. Generating LLM findings is free and easy. Generating true ones is not. It takes a quantitative gate, cross-run reproducibility, and a strong judge, and judge strength dominates: a mid-tier judge will rubber-stamp a confidently wrong pattern that a strong one refutes. The fix is not the most expensive judge. A cheap strict judge plus one frontier judge, trusted where they agree, did the work. Most naive discoveries did not survive that gate.

## How It Abstracts

The shape generalizes. A Layer 1 data service paired with a Layer 2 reasoning surface, testing where the seam falls. Swap the data plane or the reasoning surface and the motion repeats.

So does the verdict. At these corpus sizes, owned-edge inference rarely beats cloud on cost or speed, because cloud inference is already cheap. The case for the local box is air-gap, residency, or as a development instrument that ports to the cluster, not edge-inference economics. The workload lifts to any large-corpus, latency-insensitive task: legal discovery, support ticket mining, research-paper analysis, log and event mining.

## Assessments at the Time of the Lab

| Vendor | Layer | Grade | As assessed |
| --- | --- | --- | --- |
| AWS AI Infrastructure | Layer 1A · Storage | Delegated | June 20, 2026 |
| AWS AI Infrastructure | Layer 1B · Retrieval | Delegated | June 20, 2026 |
| NVIDIA AI Platform | Layer 0 · Compute | NVIDIA Strength — Silicon Authority | May 22, 2026 |
| NVIDIA AI Platform | Layer 2B · Runtime | NVIDIA Authority — Inference + Agent Runtime | May 22, 2026 |
| NVIDIA AI Platform | Layer 2C · Reasoning | Runtime Governance Only — Not a Reasoning Plane | May 22, 2026 |

## Method and Disclosure

Self-funded, with no sponsor, so the lab is free to mix competitors. That cross-vendor mix is the editorial signature, nobody here is selling you one box. Every lab, editorial or sponsored, is held to the same method and the same editorial control; sponsored labs simply center on the sponsor’s architecture.

Validated on a 5,000-segment subset, then run across the full corpus plus a 2026 refresh. Generation used Gemma 4 26B-A4B via vLLM on the DGX Spark, chosen over a benchmarked Llama 3.1 8B and a dense Gemma 4 31B. The data plane used Bedrock Titan Text Embeddings v2 into a 1024-dimension cosine S3 Vectors index, with labels kept filterable for hybrid retrieval.

Substrate engineering and the corpus findings are captured as working lab notes, not shipped. The classification taxonomy, the tagged corpus, the trained judges, and the specific conclusions stay proprietary. What ships is the pattern and the substrate verdict, enough to recognize and abstract.

---
*Layer2C Labs · The CTO Advisor LLC · labs.layer2c.com*