Study Module

Attention to serving, end to end

Move from the attention equation to KV cache memory, GQA/MQA tradeoffs, FlashAttention's memory schedule, long-context pressure, serving latency, and decoding behavior in one connected workspace.

interactive pathKV calculatorcarried equationsquestions to carry

Map a Paper Start at Attention

A physical KV memory lab with attention heads feeding a smaller cache.

01QK^T

02KV cache

03GQA

04TPOT

decode memoryMem_KVB * L * T * H_kv * d * 2

Study Module

One paper route from math to production.

This module reads transformer inference as one continuous mechanism: attention defines the copy operation, cache design decides what can be reused, and serving turns every symbol into memory, latency, and quality tradeoffs.

content-addressed weighted copy

Attention

Which previous tokens should this query copy from?

A weighted value vector per head.Open concept

Carried Equations

Every formula is a route object.

Mem_KV = B * N_layers * T * H_kv * d_head * 2 * bytes

Bactive batch sizescalar

N_layersnumber of transformer layersscalar

Tcached tokens per sequencescalar

H_kvkey/value heads after MHA, GQA, or MQA sharingscalar

d_headwidth of each key/value headscalar

2keys plus values are both cachedconstant

bytesbytes per scalar for the chosen precisionscalar

sourceEfficient Attention / LLM Serving

Double the context and this term doubles. During decode it is touched on every generated token.

Read the concept

KV Memory Lab

Change the serving budget.

Predict firstA paper changes MHA to GQA or MQA. Which memory symbol should move first?

Hold B, N_layers, T, d_head, and bytes fixed. A model variant shares K/V heads. Commit to the term before touching the controls.

Choose an answer to unlock the calculator.

Context tokens32,768Layers32Batch4Query heads32KV heads32Real GQA usually uses KV-head counts that divide query heads; this slider shows cache-width scaling.Head dim128

current KV cache68.7 GB

0% smaller than full MHA under these settings.

MHA / GQA / MQA

Same attention equation, different cache width.

MHA32 KV heads

Each query head owns its KV head.

68.7 GB

GQA32 KV heads

Query heads share KV within groups.

68.7 GB

MQA1 KV head

All query heads share one KV head.

2.1 GB

Decoding Lab

Sampling controls change the next-token set.

Predict firstHigh temperature, top-k = 1: what survives?

The lab reshapes next-token probabilities with temperature, filters the candidate set, then renormalizes what remains. Predict the binding constraint before inspecting the token list.

Choose first; the probe will then set the sliders to the high-temperature, top-k = 1 case.

Temperature0.8Top-k4Top-p0.90

cache54%

memory29%

latency17%

qualitycut

papercut

noisecut

Research Room

Keep the argument attached to the exact claim.

Pick a route object before asking for help. The selected paper claim, equation, lab, or misconception becomes the saved focus and companion context.

claimPaper claim

KV compression claim

Anchored question

What exactly is being compressed: heads, tokens, values, precision, or cache pages?

Local action draftDraft unavailableNeeds a canonical object key

Local action draft

This object needs a content object key before local action drafts can attach to it.

Draft noteNext action

No local draft saved.

Evidence to inspect

Exact source quote or local paper clue that motivates the claim
The equation, concept, or toy lab that could falsify the claim
Benchmarks, assumptions, and counterexamples that would change confidence

What would resolve this

The claim is either source-supported, weakened, or marked unverified
The mechanism is separated from benchmark or marketing language
The learner knows what evidence would raise or lower confidence

Grounded AI handoff

I am working in Continuous Function's research reading room. Object: claim - KV compression claim Context: Paper claim Anchor id: claim/attention-serving/what-is-compressed Open question: What exactly is being compressed: heads, tokens, values, precision, or cache pages? Evidence to inspect: - Exact source quote or local paper clue that motivates the claim - The equation, concept, or toy lab that could falsify the claim - Benchmarks, assumptions, and counterexamples that would change confidence What would resolve this: - The claim is either source-supported, weakened, or marked unverified - The mechanism is separated from benchmark or marketing language - The learner knows what evidence would raise or lower confidence Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

claim/attention-serving/what-is-compressed

AI Focus Object

Ask about one exact object.

The companion prompt follows this selection, so a question can attach to a stage, equation, lab checkpoint, or discussion anchor instead of floating over the whole page.

EquationKV cache memory

Double the context and this term doubles. During decode it is touched on every generated token.

Ready to ask

Object Companion

Ask beside the selected object

Ask about the current paper claim, equation object, lab setting, saved observation, or discussion anchor. In static preview this remains a grounded prompt surface; when the gateway is configured it becomes live assistance.

Your question

GoalComfortStyleStuck on

Context prompt

You are my AI learning companion for Continuous Function. Current context: Study module: Attention -> Efficient Attention -> RoPE -> FlashAttention -> Long Context -> LLM Serving -> Decoding. Learning surface: Attention to serving route. What this page says: Ask about the current paper claim, equation object, lab setting, saved observation, or discussion anchor. In static preview this remains a grounded prompt surface; when the gateway is configured it becomes live assistance. Current section: Active stage: Attention (content-addressed weighted copy) Active equation: KV cache memory: Mem_KV = B * N_layers * T * H_kv * d_head * 2 * bytes Focused object: Equation - KV cache memory Equation: KV cache memory Mem_KV = B * N_layers * T * H_kv * d_head * 2 * bytes Source: Efficient Attention / LLM Serving Symbols: B scalar, N_layers scalar, T scalar, H_kv scalar, d_head scalar, 2 constant, bytes scalar Stress test: Double the context and this term doubles. During decode it is touched on every generated token. KV lab: B=4, T=32,768, layers=32, H_q=32, H_kv=32, d_head=128, precision=2 bytes/scalar Current KV estimate: 68.7 GB; 0% smaller than full MHA under these settings. Route progress: 0/7 stages ready; next repair Attention. Paper evidence carried: none. No saved lab observation yet.. Suggested next step: Commit to the KV memory prediction, then change one serving variable at a time.. Learner goal: Understand the idea. Learner comfort level: New to this. Preferred explanation style: Visual first. Task: Help me inspect a paper claim about KV cache compression. Identify which symbol or system bottleneck the claim changes, what remains fixed, and what evidence I should ask for before believing it. Answer in a way that helps me learn: ask one clarifying question only if needed, use intuition before notation, and end with one thing I should try on the page.