Launch Route

Attention To Serving

A browser-local route from attention math to KV-cache memory, serving tradeoffs, a prediction-first lab, an evaluation guardrail, and scoped contribution notes. This is a learning route, not a benchmark result or production validation.

browser-local previewKV prediction labclaim guardrailscoped contribution

Start with Efficient Attention Open KV prediction lab Review eval guardrail

A physical KV memory lab with attention heads feeding a smaller cache.

01QK^T

02KV cache

03GQA

04TPOT

decode memoryMem_KVB * L * T * H_kv * d * 2

First Route Readiness

First route: attention to serving

A new learner should see one concrete path: map or search a claim, inspect the route graph, open Efficient Attention, predict the KV-cache lever, review the claim boundary, and leave a reproduction or repair note attached to the exact object.

QuestionHow does attention become a serving bottleneck?

Next repairEfficient Attention

Lab stateKV prediction lab live

Attention To ServingCommit the KV-memory prediction before changing controls.

Reveal which symbol moves, then save a checkpoint.

Next: Efficient Attention concept Open Efficient Attention KV prediction checkpoint

01Homenext action 02Paper Mappernext action 03Knowledge Graphnext action 04Searchnext action 05Attention To Servingnow +1 more

Scope labels requiredPreview/local where applicable; claim review before comparison; reproduction notes stay object-attached.

No benchmark result, hosted compute, automatic expert review, or live runtime-performance claim is made by this v0.

Evidence Spine

Source-scoped, not a production claim.

Paper-checked mechanisms, mapped implementation/protocol items, and toy witnesses stay separate. Open gaps remain visible; these labs do not measure live system performance.

spine stages7partial

claim checked4mechanism anchors

mapped context3serving/eval scope

open gaps6kept explicit

Source-spine tracked internally.Use this panel as the learner boundary: source-scoped, not production-verified.

Comparison Guardrail

Comparison prep note, not a benchmark.

Protocol-scoped artifact.

Before comparing serving systems, pin the model, serving stack, task or scenario, evaluator, metric, hardware/runtime, contamination caveat, raw artifacts, reviewer path, and toy-vs-production boundary. This route does not report live performance, current vLLM behavior, model-order claims, or production validation.

artifactdraft protocol note

runnot run

resultno scores reported

scopeno production evidence

Level 4 boundaryServing evaluation result artifact contract v0

Contract/schema only; no benchmark run, evaluation score, live serving measurement, hardware sizing or deployment guidance, current runtime claim, or production validation.

content/research-rooms/attention-to-serving/level-4/contracts/serving_evaluation_result_contract_v0.json

requiredModel + weights

Name the exact model family, checkpoint or weights, tokenizer, precision, and any quantization.

requiredServing stack

Pin the serving runtime, release or commit, scheduler/cache settings, batch policy, and dependency versions.

requiredEvaluator version

Pin the evaluator or benchmark harness version, config file, prompt template, and command surface.

requiredScenario / task config

Name the scenario, task, dataset split, request shape, input/output lengths, and any limits.

requiredMetric + aggregation

Declare the metric, aggregation rule, quality target, latency statistic, and confidence interval if used.

requiredHardware / runtime

Record accelerator, CPU, memory, interconnect, driver, OS, batch limits, and concurrency policy.

requiredData + contamination caveat

State dataset version, split, filtering, prompt exposure risk, and whether contamination was checked.

requiredToy vs production boundary

Separate the local formula/demo witness from any measured system result, deployment decision, or public system comparison.

future result contractno scores reported

Future result sections: identity, protocol_binding, system_under_test, model, serving_stack, evaluator. Promotion requires future result artifact exists under level-4/results/; exact evaluator command and config are pinned; dataset/task version and contamination caveat are pinned.

Allowed metric groups after review: quality, latency, throughput, resource_memory.

scopeI compared nothing yet; I first pinned the evaluation contract.

What is being measured, under which scenario, with which versioned evaluator, and what remains a toy witness?

1 pinned run gap remains before comparison.

Open evaluation pipeline concept

Learning Lab Ladder

Lab ladder: learn first, then reproduce.

This route is ready for source-grounded reading, browser witnesses, standalone Level 2 scripts, one narrow Level 3 local KV-cache experiment, and a Level 4 result-contract gate. Evaluation runs and open contribution tasks are explicit next artifacts, not finished claims.

Level 0Read concept12 supported

Level 1Browser demo12 supported

Level 2Toy notebook / script5 supported · 7 partial

Level 3Small reproducible experiment2 supported · 10 planned

Level 4Benchmark / evalnone supported · 12 planned

Level 5Open project contributionnone supported · 12 planned

Attention TransformersLevel 2: Toy notebook / script

L0supportedL1supportedL2partially supportedL3plannedL4plannedL5planned

Not ready: The code witness exists, but no standalone route script with pinned inputs, expected output, and failure note exists yet. Package a standalone route script that prints QK scores, softmax rows, and value mixtures with the same symbol names.

Dot ProductLevel 2: Toy notebook / script

L0supportedL1supportedL2partially supportedL3plannedL4plannedL5planned

Not ready: The code witness exists, but no standalone prerequisite script with pinned inputs, expected output, and failure note exists yet. Package a standalone prerequisite script that shows how dot products become attention logits.

Positional EncodingLevel 2: Toy notebook / script

L0supportedL1supportedL2partially supportedL3plannedL4plannedL5planned

Not ready: The code witness exists, but no standalone route script with pinned inputs, expected output, and failure note exists yet. Package a standalone script that prints position vectors and offset dot products.

RopeLevel 2: Toy notebook / script

L0supportedL1supportedL2partially supportedL3plannedL4plannedL5planned

Not ready: The code witness exists, but no standalone route script with pinned inputs, expected output, and failure note exists yet. Package a standalone RoPE script with explicit phase, offset, and caveat outputs.

Efficient AttentionLevel 3: Small reproducible experiment

L0supportedL1supportedL2supportedL3supportedL4plannedL5planned

Ready: small local experiment. A seeded synthetic request trace keeps GQA/MQA formula-memory ratios stable across context-window and batch-size grids while larger context windows increase formula KV-cache estimates under the clipping rule.python3 content/research-rooms/attention-to-serving/level-3/kv_cache_width_sweep.py --config content/research-rooms/attention-to-serving/level-3/configs/kv_cache_width_sweep_v0.jsonPinned synthetic config - not benchmark, eval, live serving, current runtime, hardware sizing, or production validation.

Flash AttentionLevel 2: Toy notebook / script

L0supportedL1supportedL2supportedL3plannedL4plannedL5planned

Ready: standalone script. A streaming online-softmax merge matches full softmax attention on fixed scores/values while toy scratch accounting avoids materializing the full score matrix.python3 content/research-rooms/attention-to-serving/level-2/flashattention_io_witness.pyToy local witness - not benchmark, live serving, capacity-planning, or production validation.

Grouped Query AttentionLevel 2: Toy notebook / script

L0supportedL1supportedL2supportedL3plannedL4plannedL5planned

Ready: standalone script. For fixed shape and dtype, the GQA KV-cache ratio against MHA is H_kv / H_q, with each group sharing one KV head.python3 content/research-rooms/attention-to-serving/level-2/gqa_cache_width_witness.pyToy local witness - not benchmark, live serving, capacity-planning, or production validation.

Long ContextLevel 3: Small reproducible experiment

L0supportedL1supportedL2supportedL3supportedL4plannedL5planned

LLM ServingLevel 2: Toy notebook / script

L0supportedL1supportedL2supportedL3plannedL4plannedL5planned

Ready: standalone script. Toy latency accounting decomposes TTFT plus TPOT, KV-cache formula size, page waste, and explicit SLO pass/fail samples.python3 content/research-rooms/attention-to-serving/level-2/serving_loop_witness.pyToy local witness - not benchmark, live serving, capacity-planning, or production validation.

Speculative DecodingLevel 2: Toy notebook / script

L0supportedL1supportedL2partially supportedL3plannedL4plannedL5planned

Not ready: The code witness exists, but no standalone route script with pinned inputs, expected output, and failure note exists yet. Package a standalone speculative decoding script with draft/target outputs and acceptance caveats.

Mixture Of ExpertsLevel 2: Toy notebook / script

L0supportedL1supportedL2partially supportedL3plannedL4plannedL5planned

Not ready: The code witness exists, but no standalone route script with pinned inputs, expected output, and failure note exists yet. Package a standalone MoE routing script that reports token dispatch and load-balance assumptions.

MOE ServingLevel 2: Toy notebook / script

L0supportedL1supportedL2partially supportedL3plannedL4plannedL5planned

Not ready: The code witness exists, but no standalone route script with pinned inputs, expected output, and failure note exists yet. Package a standalone MoE serving toy script with dispatch, all-to-all, and caveat outputs.

Levels 4-5 gated/planned

Level 3 is partially supported for one local CPU seeded synthetic-request KV-cache formula experiment. Level 4 has a contract template only and no scored result; Levels 4-5 remain gated/planned. No hosted compute, benchmark result, shared GPU job, production validation, current runtime claim, hardware guidance, or live serving measurement is implied.

Does not claim: hosted compute, shared GPU jobs, comparison scores, production validation, live serving measurement.

Contribution Loop

Contribute to this route object.

Pick the exact claim, equation, lab, or gap you checked. The public intake packet carries the attached object, evidence, room task, and next repair so the Attention To Serving route can improve without becoming a feed. This is an interim public preview surface, not a public GitHub issue board yet.

Attached objectKV cache memoryconcept:attention-transformers/efficient-attention

StatusPublic intake packettemplate backed

Room taskbeginner-prerequisite-scanDouble the context and this term doubles. During decode it is touched on every generated token.

Confusing jumpMissing prerequisiteSource-scope issueCode witness failedReproduction succeededReproduction failedPropose open task

Confusing jumpA precise confusion note that can become a bridge, prerequisite, or wording repair.

maintainer triage -> route/content fix -> visible resolution note

Prepare scoped note

Missing prerequisiteA named missing concept, symbol, or shape that can become a prerequisite bridge.

maintainer triage -> prerequisite route patch -> reviewer check

Prepare scoped note

Source-scope issueExact wording plus source span, with safer source-scoped wording proposed.

source triage -> claim wording patch -> GPT Pro or maintainer source check

Prepare scoped note

Code witness failedExact command, environment, expected output, and failure output for a runnable witness repair.

maintainer triage -> reproduce locally -> code witness patch -> validator run

Prepare scoped note

Reproduction reportA success, failure, or partial local run note with caveats intact.

maintainer triage -> artifact note -> scope check -> visible resolution

Prepare scoped note

Propose open taskOne object-attached task with acceptance criteria, needed sources or artifacts, and reviewer path.

maintainer triage -> scope check -> queue or close with explanation

Prepare scoped note

feedbacktriagefix or explainreviewer checkvisible resolution

No social feed, generic chat, hosted compute, public issue board, or automatic expert review.

Keep claims source-scoped. This preview prepares an object-attached note; it does not create a public GitHub issue yet, and no benchmark, production, or live-serving claim is accepted unless an approved result artifact exists.

Launch Route

One source-scoped route from attention math to serving tradeoffs.

This module reads transformer inference as one continuous mechanism: attention defines the copy operation, cache design decides what can be reused, and serving turns every symbol into memory, latency, and quality tradeoffs.

content-addressed weighted copy

Attention

Which previous tokens should this query copy from?

A weighted value vector per head.Open concept

Carried Equations

Every formula is a route object.

Mem_KV = B * N_layers * T * H_kv * d_head * 2 * bytes

Bactive batch sizescalar

N_layersnumber of transformer layersscalar

Tcached tokens per sequencescalar

H_kvkey/value heads after MHA, GQA, or MQA sharingscalar

d_headwidth of each key/value headscalar

2keys plus values are both cachedconstant

bytesbytes per scalar for the chosen precisionscalar

sourceEfficient Attention / LLM Serving

Double the context and this term doubles. During decode it is touched on every generated token.

Read the concept

KV Memory Lab

Change the serving budget.

Predict firstA paper changes MHA to GQA or MQA. Which memory symbol should move first?

Hold B, N_layers, T, d_head, and bytes fixed. A model variant shares K/V heads. Commit to the term before touching the controls.

Choose an answer to unlock the calculator.

Context tokens32,768Layers32Batch4Query heads32KV heads32Real GQA usually uses KV-head counts that divide query heads; this slider shows cache-width scaling.Head dim128

estimated KV cache memory68.7 GB

0% smaller than full MHA under these settings.

MHA / GQA / MQA

Same attention equation, different cache width.

MHA32 KV heads

Each query head owns its KV head.

68.7 GB

GQA32 KV heads

Query heads share KV within groups.

68.7 GB

MQA1 KV head

All query heads share one KV head.

2.1 GB

Decoding Lab

Sampling controls change the next-token set.

Predict firstHigh temperature, top-k = 1: what survives?

The lab reshapes next-token probabilities with temperature, filters the candidate set, then renormalizes what remains. Predict the binding constraint before inspecting the token list.

Choose first; the probe will then set the sliders to the high-temperature, top-k = 1 case.

Temperature0.8Top-k4Top-p0.90

cache54%

memory29%

latency17%

qualitycut

papercut

noisecut

Research Room

Keep the argument attached to the exact claim.

Pick a route object before asking for help. The selected paper claim, equation, lab, or misconception becomes the saved focus and companion context.

claimPaper claim

KV compression claim

Anchored question

What exactly is being compressed: heads, tokens, values, precision, or cache pages?

Local action draftDraft unavailableNeeds a canonical object key

Local action draft

This object needs a content object key before local action drafts can attach to it.

Draft noteNext action

No local draft saved.

Evidence to inspect

Exact source quote or local paper clue that motivates the claim
The equation, concept, or toy lab that could falsify the claim
Benchmarks, assumptions, and counterexamples that would change confidence

What would resolve this

The claim is either source-supported, weakened, or marked unverified
The mechanism is separated from benchmark or marketing language
The learner knows what evidence would raise or lower confidence

Grounded AI handoff

I am working in Continuous Function's research reading room. Object: claim - KV compression claim Context: Paper claim Anchor id: claim/attention-serving/what-is-compressed Open question: What exactly is being compressed: heads, tokens, values, precision, or cache pages? Evidence to inspect: - Exact source quote or local paper clue that motivates the claim - The equation, concept, or toy lab that could falsify the claim - Benchmarks, assumptions, and counterexamples that would change confidence What would resolve this: - The claim is either source-supported, weakened, or marked unverified - The mechanism is separated from benchmark or marketing language - The learner knows what evidence would raise or lower confidence Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

claim/attention-serving/what-is-compressed

AI Focus Object

Ask about one exact object.

The companion prompt follows this selection, so a question can attach to a stage, equation, lab checkpoint, or discussion anchor instead of floating over the whole page.

EquationKV cache memory

Double the context and this term doubles. During decode it is touched on every generated token.

Ready to ask

Object Companion

Ask beside the selected object

Ask about the current paper claim, equation object, lab setting, saved observation, or discussion anchor. In static preview this remains a grounded prompt surface; when the gateway is configured it becomes live assistance.

Your question

GoalComfortStyleStuck on

Context prompt

You are my AI learning companion for Continuous Function. Current context: Study module: Attention -> Efficient Attention -> RoPE -> FlashAttention -> Long Context -> LLM Serving -> Decoding. Learning surface: Attention To Serving route. What this page says: Ask about the current paper claim, equation object, lab setting, saved observation, or discussion anchor. In static preview this remains a grounded prompt surface; when the gateway is configured it becomes live assistance. Current section: Active stage: Attention (content-addressed weighted copy) Active equation: KV cache memory: Mem_KV = B * N_layers * T * H_kv * d_head * 2 * bytes Focused object: Equation - KV cache memory Equation: KV cache memory Mem_KV = B * N_layers * T * H_kv * d_head * 2 * bytes Source: Efficient Attention / LLM Serving Symbols: B scalar, N_layers scalar, T scalar, H_kv scalar, d_head scalar, 2 constant, bytes scalar Stress test: Double the context and this term doubles. During decode it is touched on every generated token. KV lab: B=4, T=32,768, layers=32, H_q=32, H_kv=32, d_head=128, precision=2 bytes/scalar Current KV estimate: 68.7 GB; 0% smaller than full MHA under these settings. Formula scope: Toy formula-only KV-memory estimate, not a live serving measurement. Excludes allocator behavior, paged-cache metadata, scheduler effects, tensor layout overhead, live throughput/latency, hardware sizing/deployment guidance, and current vLLM runtime evidence. Route source scope: source-scoped in progress; 4 claim-checked stages, 3 source-mapped stages, 6 explicit gaps; this route is not measured against live serving systems. Evaluation protocol: draft protocol note, not run, no scores reported; required fields Model + weights, Serving stack, Evaluator version, Scenario / task config, Metric + aggregation, Hardware / runtime, Data + contamination caveat, Toy vs production boundary; Level 4 contract serving-evaluation-result-contract-v0 has no scores reported; no benchmark comparison until the setup, result artifacts, and review path are pinned. Lab ladder: 12 concepts mapped; Level 0 12 supported / 0 partial / 0 planned; Level 1 12 supported / 0 partial / 0 planned; Level 2 5 supported / 7 partial / 0 planned; Level 3 2 supported / 0 partial / 10 planned; Level 4 0 supported / 0 partial / 12 planned; Level 5 0 supported / 0 partial / 12 planned. Runnable Level 2 scripts: kv-memory-witness, gqa-mqa-cache-width-witness, flash-attention-io-count-witness, serving-loop-toy-witness, cost-latency-observability-witness. Runnable Level 3 experiments: kv-cache-width-sweep-v0. Levels 4-5 gated/planned. Level 3 is partially supported for one local CPU seeded synthetic-request KV-cache formula experiment. Level 4 has a contract template only and no scored result; Levels 4-5 remain gated/planned. No hosted compute, benchmark result, shared GPU job, production validation, current runtime claim, hardware guidance, or live serving measurement is implied. Contribution loop: 6 issue templates (Confusing jump, Missing prerequisite, Source-scope issue, Code witness failed, Reproduction succeeded, Reproduction failed, Propose open task); required anchors feedback_type, room_id, route_id, route, object_anchor_type, room_object_id, object_ref, anchor_detail, current_question, summary, room_task_id, expected_resolution; templates confusing_jump.yml, missing_prerequisite.yml, source_scope_issue.yml, code_witness_failed.yml, reproduction_report.yml, proposed_open_task.yml; no social feed, generic chat, hosted compute, or automatic expert review. Route progress: 0/7 stages ready; next repair Attention. Paper evidence carried: none. No saved lab observation yet.. Suggested next step: Commit to the KV memory prediction, then change one serving variable at a time.. Learner goal: Understand the idea. Learner comfort level: New to this. Preferred explanation style: Visual first. Task: Help me inspect a paper claim about KV cache compression. Identify which symbol or system bottleneck the claim changes, what remains fixed, and what evidence I should ask for before believing it. Answer in a way that helps me learn: ask one clarifying question only if needed, use intuition before notation, and end with one thing I should try on the page.