# Spark + S3 Vectors — Raw Lab Detail

Layer2C Labs · The CTO Advisor · https://labs.layer2c.com/labs/spark-s3vectors

This is the substrate-level record of the lab: what was stood up, how it was measured, the
numbers, and the gotchas. It is the engineering, not the analysis. **The dataset's contents,
the classification taxonomy, the tagged corpus, the trained classifiers and judges, the
validation thresholds, and the specific findings are proprietary and are not included here.**

---

## The substrate

- **Data plane:** Amazon Bedrock Titan Text Embeddings v2 (1024-dim, cosine) into Amazon S3
  Vectors. AWS-managed.
- **Compute and reasoning:** NVIDIA DGX Spark (GB10 Grace Blackwell, 128 GB unified memory,
  ~119 GB usable, ~273 GB/s bandwidth). Generation served with vLLM.
- **Corpus:** ~237,000 pre-segmented units across 5,815 videos, 2013 to 2026. One ~500-character
  segment maps to one vector. No re-chunking.

## What got stood up

- Validated on a 5,000-segment stride sample, then run across the full corpus.
- **S3 Vectors index:** 1024-dim, cosine, float32. `AMAZON_BEDROCK_TEXT` and
  `AMAZON_BEDROCK_METADATA` marked non-filterable; structured fields (video id, title, year, and
  a category label) kept filterable to enable hybrid retrieval.
- **Generation:** Gemma 4 26B-A4B (mixture of experts, ~4B active), BF16, in the nvcr.io vLLM
  26.03 container on the GB10. ~340 s cold start. `gpu-mem-util` 0.85 (~111/119 GB); ~51 GB to
  hold the model. Also benchmarked Llama 3.1 8B (dense, BF16) and Gemma 4 31B (dense, Q4).
- **Ingest harness:** threaded Titan embed plus batched S3 Vectors put-vectors (16 workers, run
  from a Mac).
- **Query harness:** on the Spark, timing each stage — Titan embed (cloud), `s3vectors` query
  (cloud), assemble top-5, vLLM generate (local) — over 30 representative questions.

## Numbers

| Metric | Value |
|---|---|
| Ingest throughput | 337,345 segments/hour |
| Full corpus index build | ~41 min, 0 failures (5K subset: 56 s) |
| Retrieval latency, from the Spark | p50 279.9 ms · p95 504.6 ms |
| Retrieval latency, attribution | ~180 ms is the Spark's VPN/relay RTT; S3 Vectors service ~100 ms |
| Generation, Gemma 4 26B-A4B (local) | ~22 tok/s |
| Generation, Llama 3.1 8B (local) | ~14 tok/s |
| Generation, Gemma 4 31B Q4 (local) | ~9 tok/s |
| Generation, same models on Bedrock | 7–9× faster (26B-A4B 136–214, 31B 34–59) |
| Single-query end-to-end | ~6.7 s (~6 s is local generation, not retrieval) |
| Batch, 30 questions | 201.8 s |
| Cost, full corpus embed + index | ~$0.84 one-time + ~$0.07/month |
| Cost, full analysis pass | ~$1.16 |
| Cloud-API generation baseline, 30 answers | ~$0.03 (Haiku-class) to ~$0.11 (Sonnet-class) |

## Seam readiness (substrate)

- **Embed** — needs cloud. Titan v2 in Bedrock, by design.
- **Index / store** — ready. 337K/hr, sub-second queries, and hybrid semantic-plus-metadata
  filtering in one call.
- **Batch query** — ready. Latency is irrelevant for an overnight corpus job.
- **Interactive query** — conditional. Single-query end-to-end ~6.7 s, dominated by local
  generation, not retrieval.
- **Generate** — conditional. Local works; the same model runs 7–9× faster on Bedrock at no cost
  advantage.

## What we learned about the substrate

- **Cost is a non-issue at this scale.** The full 13-year corpus embeds and indexes for under a
  dollar one-time. The "spends real money" guardrail was conservative by ~2 orders of magnitude.
- **Retrieval is not the bottleneck.** The ~280 ms seen from the Spark is the box's own network
  path; S3 Vectors' service is ~100 ms. Measure from where the box actually sits.
- **The constraint is local generation throughput.** Single-stream decode on the GB10 is
  memory-bandwidth-bound (~273 GB/s), so active-parameter count is what matters: a ~4B-active MoE
  ran faster than a dense 8B and far faster than a dense 31B.
- **The chunking knob is the real loss in the managed path.** The managed Knowledge Base
  re-chunks with its own strategy and turns on parsing you did not ask for; bring-your-own kept
  one segment as one vector with labels attached, which is what enabled hybrid (label-filtered)
  retrieval. Borrow the vendor's judgment on plumbing; keep your own on chunking.

## Model capability and the economic axis

- **Faster is not better.** The chosen MoE (Gemma 4 26B-A4B) won single-stream throughput, but a
  dense Gemma 4 31B reasoned better over the same data. We ran the dense model both on the GB10
  (~9 tok/s) and in Bedrock (34–59 tok/s), and it became the basis for testing more capable
  models, up to foundation models. Capability scaled with reasoning quality.
- **Value-per-answer, not cost-per-token.** Cheap, fast models minimize token cost; they do not
  maximize the value of the output. A more capable model, up to a foundation model, can be the
  cheaper choice in net when its reasoning carries more business value than the marginal token
  cost. This is the opposite axis from validation: for the judging gate, a cheap strict model
  matched a frontier judge, but for the analytical reasoning, capability paid off. Judging and
  reasoning are different tasks; do not price them the same way.

## Gotchas (the part you actually want)

- **Bedrock Knowledge Bases create** (`bedrock:CreateKnowledgeBase`) is account-gated even for a
  full AdministratorAccess IAM user (no SCP/RCP exists). The console quick-create works. We
  self-orchestrated the identical Titan-to-S3-Vectors substrate with direct API calls, which also
  gave cleaner per-stage instrumentation.
- **S3 Vectors index parameters** (name, dimension, distance metric, non-filterable metadata
  keys) are immutable after creation. Get them right on first creation, or recreate the index.
- **Bedrock model enablement is fiddly.** Some models are served on a separate
  OpenAI-compatible endpoint (bearer token) rather than `bedrock-runtime`; some are Responses-API
  only; some need an account use-case form; some were not accessible to the project at all.
- **Spark serving:** only one large model fits in unified memory (`gpu-mem-util` 0.85 ≈
  111/119 GB). `docker start` on a stopped vLLM container fails — always a fresh `docker run`.
- **Spark egress is high-latency** (VPN/relay path, ~180 ms RTT), which dominates end-to-end
  retrieval latency. It is not a clean read of S3 Vectors' own service latency.
- **AWS identity:** stale long-lived access keys were dead; a cached developer-tool token could
  not sign Bedrock; the root account cannot assume roles and was blocked from some create
  actions. We created a dedicated IAM user as the working identity.
- **Bulk transcript and metadata fetches** from the public source hit anti-bot throttling (~500
  requests, then blocked). A first-party data API solved the metadata; the last few percent of
  fetches still needed a residential IP or cookies.

## On validation (generic)

Generating LLM "findings" is free and easy. Generating true ones is not. It takes a quantitative
gate, cross-run reproducibility, and a strong judge, and judge strength dominates: a mid-tier
judge will confidently rubber-stamp a wrong pattern that a strong judge refutes. The most
expensive judge is not required. A cheap, strict judge paired with one frontier judge, trusted
where they agree, did the work. Judge-tier economics span roughly $0.44 to $50-plus per thousand
calls, and the cheap strict open judge matched the most expensive frontier judge about as often
as anything did.

The specific judge configuration, the rubric and thresholds, the classification taxonomy, and the
corpus conclusions are proprietary and are not included here.

---

*Layer2C Labs · The CTO Advisor LLC · labs.layer2c.com*