Launch Route
Attention To Serving
A browser-local route from attention math to KV-cache memory, serving tradeoffs, a prediction-first lab, an evaluation guardrail, and scoped contribution notes. This is a learning route, not a benchmark result or production validation.

First Route Readiness
First route: attention to serving
A new learner should see one concrete path: map or search a claim, inspect the route graph, open Efficient Attention, predict the KV-cache lever, review the claim boundary, and leave a reproduction or repair note attached to the exact object.
Reveal which symbol moves, then save a checkpoint.
No benchmark result, hosted compute, automatic expert review, or live runtime-performance claim is made by this v0.
Evidence Spine
Source-scoped, not a production claim.
Paper-checked mechanisms, mapped implementation/protocol items, and toy witnesses stay separate. Open gaps remain visible; these labs do not measure live system performance.
Comparison Guardrail
Comparison prep note, not a benchmark.
Protocol-scoped artifact.Before comparing serving systems, pin the model, serving stack, task or scenario, evaluator, metric, hardware/runtime, contamination caveat, raw artifacts, reviewer path, and toy-vs-production boundary. This route does not report live performance, current vLLM behavior, model-order claims, or production validation.
Contract/schema only; no benchmark run, evaluation score, live serving measurement, hardware sizing or deployment guidance, current runtime claim, or production validation.
content/research-rooms/attention-to-serving/level-4/contracts/serving_evaluation_result_contract_v0.jsonName the exact model family, checkpoint or weights, tokenizer, precision, and any quantization.
Pin the serving runtime, release or commit, scheduler/cache settings, batch policy, and dependency versions.
Pin the evaluator or benchmark harness version, config file, prompt template, and command surface.
Name the scenario, task, dataset split, request shape, input/output lengths, and any limits.
Declare the metric, aggregation rule, quality target, latency statistic, and confidence interval if used.
Record accelerator, CPU, memory, interconnect, driver, OS, batch limits, and concurrency policy.
State dataset version, split, filtering, prompt exposure risk, and whether contamination was checked.
Separate the local formula/demo witness from any measured system result, deployment decision, or public system comparison.
Future result sections: identity, protocol_binding, system_under_test, model, serving_stack, evaluator. Promotion requires future result artifact exists under level-4/results/; exact evaluator command and config are pinned; dataset/task version and contamination caveat are pinned.
Allowed metric groups after review: quality, latency, throughput, resource_memory.
What is being measured, under which scenario, with which versioned evaluator, and what remains a toy witness?
1 pinned run gap remains before comparison.
Open evaluation pipeline conceptLearning Lab Ladder
Lab ladder: learn first, then reproduce.
This route is ready for source-grounded reading, browser witnesses, standalone Level 2 scripts, one narrow Level 3 local KV-cache experiment, and a Level 4 result-contract gate. Evaluation runs and open contribution tasks are explicit next artifacts, not finished claims.
Not ready: The code witness exists, but no standalone route script with pinned inputs, expected output, and failure note exists yet. Package a standalone route script that prints QK scores, softmax rows, and value mixtures with the same symbol names.
Not ready: The code witness exists, but no standalone prerequisite script with pinned inputs, expected output, and failure note exists yet. Package a standalone prerequisite script that shows how dot products become attention logits.
Not ready: The code witness exists, but no standalone route script with pinned inputs, expected output, and failure note exists yet. Package a standalone script that prints position vectors and offset dot products.
Not ready: The code witness exists, but no standalone route script with pinned inputs, expected output, and failure note exists yet. Package a standalone RoPE script with explicit phase, offset, and caveat outputs.
Ready: small local experiment. A seeded synthetic request trace keeps GQA/MQA formula-memory ratios stable across context-window and batch-size grids while larger context windows increase formula KV-cache estimates under the clipping rule.python3 content/research-rooms/attention-to-serving/level-3/kv_cache_width_sweep.py --config content/research-rooms/attention-to-serving/level-3/configs/kv_cache_width_sweep_v0.jsonPinned synthetic config - not benchmark, eval, live serving, current runtime, hardware sizing, or production validation.
Ready: standalone script. A streaming online-softmax merge matches full softmax attention on fixed scores/values while toy scratch accounting avoids materializing the full score matrix.python3 content/research-rooms/attention-to-serving/level-2/flashattention_io_witness.pyToy local witness - not benchmark, live serving, capacity-planning, or production validation.
Ready: standalone script. For fixed shape and dtype, the GQA KV-cache ratio against MHA is H_kv / H_q, with each group sharing one KV head.python3 content/research-rooms/attention-to-serving/level-2/gqa_cache_width_witness.pyToy local witness - not benchmark, live serving, capacity-planning, or production validation.
Ready: small local experiment. A seeded synthetic request trace keeps GQA/MQA formula-memory ratios stable across context-window and batch-size grids while larger context windows increase formula KV-cache estimates under the clipping rule.python3 content/research-rooms/attention-to-serving/level-3/kv_cache_width_sweep.py --config content/research-rooms/attention-to-serving/level-3/configs/kv_cache_width_sweep_v0.jsonPinned synthetic config - not benchmark, eval, live serving, current runtime, hardware sizing, or production validation.
Ready: standalone script. Toy latency accounting decomposes TTFT plus TPOT, KV-cache formula size, page waste, and explicit SLO pass/fail samples.python3 content/research-rooms/attention-to-serving/level-2/serving_loop_witness.pyToy local witness - not benchmark, live serving, capacity-planning, or production validation.
Not ready: The code witness exists, but no standalone route script with pinned inputs, expected output, and failure note exists yet. Package a standalone speculative decoding script with draft/target outputs and acceptance caveats.
Not ready: The code witness exists, but no standalone route script with pinned inputs, expected output, and failure note exists yet. Package a standalone MoE routing script that reports token dispatch and load-balance assumptions.
Not ready: The code witness exists, but no standalone route script with pinned inputs, expected output, and failure note exists yet. Package a standalone MoE serving toy script with dispatch, all-to-all, and caveat outputs.
Level 3 is partially supported for one local CPU seeded synthetic-request KV-cache formula experiment. Level 4 has a contract template only and no scored result; Levels 4-5 remain gated/planned. No hosted compute, benchmark result, shared GPU job, production validation, current runtime claim, hardware guidance, or live serving measurement is implied.
Does not claim: hosted compute, shared GPU jobs, comparison scores, production validation, live serving measurement.Contribution Loop
Contribute to this route object.
Pick the exact claim, equation, lab, or gap you checked. The public intake packet carries the attached object, evidence, room task, and next repair so the Attention To Serving route can improve without becoming a feed. This is an interim public preview surface, not a public GitHub issue board yet.
maintainer triage -> route/content fix -> visible resolution note
maintainer triage -> prerequisite route patch -> reviewer check
source triage -> claim wording patch -> GPT Pro or maintainer source check
maintainer triage -> reproduce locally -> code witness patch -> validator run
maintainer triage -> artifact note -> scope check -> visible resolution
maintainer triage -> scope check -> queue or close with explanation
Keep claims source-scoped. This preview prepares an object-attached note; it does not create a public GitHub issue yet, and no benchmark, production, or live-serving claim is accepted unless an approved result artifact exists.
Launch Route
One source-scoped route from attention math to serving tradeoffs.
This module reads transformer inference as one continuous mechanism: attention defines the copy operation, cache design decides what can be reused, and serving turns every symbol into memory, latency, and quality tradeoffs.
Attention
Which previous tokens should this query copy from?
A weighted value vector per head.Open conceptCarried Equations
Every formula is a route object.
Mem_KV = B * N_layers * T * H_kv * d_head * 2 * bytesDouble the context and this term doubles. During decode it is touched on every generated token.
Read the conceptKV Memory Lab
Change the serving budget.
Hold B, N_layers, T, d_head, and bytes fixed. A model variant shares K/V heads. Commit to the term before touching the controls.
Choose an answer to unlock the calculator.
0% smaller than full MHA under these settings.
MHA / GQA / MQA
Same attention equation, different cache width.
Each query head owns its KV head.
Query heads share KV within groups.
All query heads share one KV head.
Decoding Lab
Sampling controls change the next-token set.
The lab reshapes next-token probabilities with temperature, filters the candidate set, then renormalizes what remains. Predict the binding constraint before inspecting the token list.
Choose first; the probe will then set the sliders to the high-temperature, top-k = 1 case.
Research Room
Keep the argument attached to the exact claim.
Pick a route object before asking for help. The selected paper claim, equation, lab, or misconception becomes the saved focus and companion context.KV compression claim
What exactly is being compressed: heads, tokens, values, precision, or cache pages?
Local action draftDraft unavailableNeeds a canonical object key
This object needs a content object key before local action drafts can attach to it.
- Exact source quote or local paper clue that motivates the claim
- The equation, concept, or toy lab that could falsify the claim
- Benchmarks, assumptions, and counterexamples that would change confidence
- The claim is either source-supported, weakened, or marked unverified
- The mechanism is separated from benchmark or marketing language
- The learner knows what evidence would raise or lower confidence
I am working in Continuous Function's research reading room. Object: claim - KV compression claim Context: Paper claim Anchor id: claim/attention-serving/what-is-compressed Open question: What exactly is being compressed: heads, tokens, values, precision, or cache pages? Evidence to inspect: - Exact source quote or local paper clue that motivates the claim - The equation, concept, or toy lab that could falsify the claim - Benchmarks, assumptions, and counterexamples that would change confidence What would resolve this: - The claim is either source-supported, weakened, or marked unverified - The mechanism is separated from benchmark or marketing language - The learner knows what evidence would raise or lower confidence Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.
claim/attention-serving/what-is-compressedAI Focus Object
Ask about one exact object.
The companion prompt follows this selection, so a question can attach to a stage, equation, lab checkpoint, or discussion anchor instead of floating over the whole page.
Double the context and this term doubles. During decode it is touched on every generated token.
Ready to askObject Companion
Ask beside the selected object
Ask about the current paper claim, equation object, lab setting, saved observation, or discussion anchor. In static preview this remains a grounded prompt surface; when the gateway is configured it becomes live assistance.
You are my AI learning companion for Continuous Function. Current context: Study module: Attention -> Efficient Attention -> RoPE -> FlashAttention -> Long Context -> LLM Serving -> Decoding. Learning surface: Attention To Serving route. What this page says: Ask about the current paper claim, equation object, lab setting, saved observation, or discussion anchor. In static preview this remains a grounded prompt surface; when the gateway is configured it becomes live assistance. Current section: Active stage: Attention (content-addressed weighted copy) Active equation: KV cache memory: Mem_KV = B * N_layers * T * H_kv * d_head * 2 * bytes Focused object: Equation - KV cache memory Equation: KV cache memory Mem_KV = B * N_layers * T * H_kv * d_head * 2 * bytes Source: Efficient Attention / LLM Serving Symbols: B scalar, N_layers scalar, T scalar, H_kv scalar, d_head scalar, 2 constant, bytes scalar Stress test: Double the context and this term doubles. During decode it is touched on every generated token. KV lab: B=4, T=32,768, layers=32, H_q=32, H_kv=32, d_head=128, precision=2 bytes/scalar Current KV estimate: 68.7 GB; 0% smaller than full MHA under these settings. Formula scope: Toy formula-only KV-memory estimate, not a live serving measurement. Excludes allocator behavior, paged-cache metadata, scheduler effects, tensor layout overhead, live throughput/latency, hardware sizing/deployment guidance, and current vLLM runtime evidence. Route source scope: source-scoped in progress; 4 claim-checked stages, 3 source-mapped stages, 6 explicit gaps; this route is not measured against live serving systems. Evaluation protocol: draft protocol note, not run, no scores reported; required fields Model + weights, Serving stack, Evaluator version, Scenario / task config, Metric + aggregation, Hardware / runtime, Data + contamination caveat, Toy vs production boundary; Level 4 contract serving-evaluation-result-contract-v0 has no scores reported; no benchmark comparison until the setup, result artifacts, and review path are pinned. Lab ladder: 12 concepts mapped; Level 0 12 supported / 0 partial / 0 planned; Level 1 12 supported / 0 partial / 0 planned; Level 2 5 supported / 7 partial / 0 planned; Level 3 2 supported / 0 partial / 10 planned; Level 4 0 supported / 0 partial / 12 planned; Level 5 0 supported / 0 partial / 12 planned. Runnable Level 2 scripts: kv-memory-witness, gqa-mqa-cache-width-witness, flash-attention-io-count-witness, serving-loop-toy-witness, cost-latency-observability-witness. Runnable Level 3 experiments: kv-cache-width-sweep-v0. Levels 4-5 gated/planned. Level 3 is partially supported for one local CPU seeded synthetic-request KV-cache formula experiment. Level 4 has a contract template only and no scored result; Levels 4-5 remain gated/planned. No hosted compute, benchmark result, shared GPU job, production validation, current runtime claim, hardware guidance, or live serving measurement is implied. Contribution loop: 6 issue templates (Confusing jump, Missing prerequisite, Source-scope issue, Code witness failed, Reproduction succeeded, Reproduction failed, Propose open task); required anchors feedback_type, room_id, route_id, route, object_anchor_type, room_object_id, object_ref, anchor_detail, current_question, summary, room_task_id, expected_resolution; templates confusing_jump.yml, missing_prerequisite.yml, source_scope_issue.yml, code_witness_failed.yml, reproduction_report.yml, proposed_open_task.yml; no social feed, generic chat, hosted compute, or automatic expert review. Route progress: 0/7 stages ready; next repair Attention. Paper evidence carried: none. No saved lab observation yet.. Suggested next step: Commit to the KV memory prediction, then change one serving variable at a time.. Learner goal: Understand the idea. Learner comfort level: New to this. Preferred explanation style: Visual first. Task: Help me inspect a paper claim about KV cache compression. Identify which symbol or system bottleneck the claim changes, what remains fixed, and what evidence I should ask for before believing it. Answer in a way that helps me learn: ask one clarifying question only if needed, use intuition before notation, and end with one thing I should try on the page.