Production ML

Cost and Latency Observability

Cost and latency observability turns serving traces into pinned evidence: workload shape, token counts, tail latency, queue/cache signals, SLOs, pricing dates, and report artifacts.

status: publishedimportance: criticaldifficulty 4/5math: undergraduateread: 16mlive demo

Back to Production ML Next: Dataset Versioning

Concept Structure

Cost and Latency Observability

01Intuition

Start with the picture, metaphor, or geometric mechanism.

02Math

Make the objects explicit and connect them with notation.

03Code

Mirror the equations with runnable implementation details.

04Interactive Demo

Manipulate the mechanism and watch the idea respond.

3prerequisites

1next concepts

3related links

Learner Contract

What this page should let you do.

You are here becauseCost and latency observability turns serving traces into pinned evidence: workload shape, token counts, tail latency, queue/cache signals, SLOs, pricing dates, and report artifacts.

This Production ML concept is the current object: keep the same invariant visible across Intuition, Math, Code, Interactive Demo.

Before thisLLM Serving at Scale: Prefill, Decode & Continuous Batching

3 prerequisites listed; refresh them before leaning on the math or code.

LLM Serving at Scale: Prefill, Decode & Continuous Batching Evaluation Pipelines Tokenization & Vocabulary Design

By the end4/4 sections ready | code witness expected | live demo

Explain the mechanism, trace the main notation, and test one prediction in the live demo.

Do this firstIntuition

Read the intuition before the notation; the math should name a mechanism you already felt.

Then go nextDataset Versioning

Follow this edge after making one prediction here; the next page should reuse the result, not restart the route.

Test the linkManipulate one control and predict the visible change.Then continue to Dataset Versioning

Claim/source review status

Substantive review recorded

1/1 claims have bounded review metadata; still check caveats and source scope.Metadata-derived; review may be AI-assisted. Not a human certification.

Claims1/1 reviewed

Sources6 cited

Codeattached

Demolive

Reviewed2026-06-30

Updatedpage 2026-06-30

Object flow

4/4 sections readyAsk about this Research room

ConceptCost and Latency ObservabilityProduction ML EquationCost and Latency Observability equation 1Exact equation object CodeCost and Latency Observability code witness 1Exact code witness DemoCost and Latency Observability interactive demoVisualization object ClaimA serving cost/latency comparison is interpretable only to the extent...Exact claim check SourcevLLM Documentation: MetricsExact source object

ConceptCost and Latency ObservabilityProduction ML

6 sources attachedLocal snapshot ready

concept:production-ml/cost-latency-observability

Codewitness nearby Predictbefore reveal Roomobject handoff

01IntuitionStart with the picture, metaphor, or geometric mechanism.02MathMake the objects explicit and connect them with notation.03CodeMirror the equations with runnable implementation details.04Interactive DemoManipulate the mechanism and watch the idea respond.

Intuition

Build the mental picture first so the rest of the page has something to attach to.

Section prompt

A serving dashboard can make a system feel measurable before it is actually comparable. You see one latency number, a token count, maybe a bill, and it is tempting to say, "Variant B is cheaper and faster."

The missing question is: cheaper and faster under what measurement contract?

For LLM serving, the contract includes the request mix, input and output token counts, runtime/config scope, required telemetry fields, aggregation rule, request-level goodput thresholds, pricing or resource-rate card, and measurement window. The report artifact then carries that contract fingerprint, the trace rows, the computed metrics, and the caveats. If a contract field moves quietly, the cost or latency number is answering a different question.

Cost and latency observability is not just dashboard decoration. It is evaluation-pipeline discipline for systems tradeoffs. A comparable claim says: here is the workload, here is the trace, here are the latency distributions, here is the token/resource accounting, here is the dated pricing card or resource-rate model, here is the report fingerprint, and here is what this evidence cannot prove.

This page is deliberately not a provider cost calculator and not a live vLLM benchmark. The Level 2 witness at content/research-rooms/attention-to-serving/level-2/cost_latency_observability_witness.py is a synthetic trace-contract artifact. It teaches what has to be pinned before a faster/cheaper serving claim becomes evidence.

Math

Translate the story into symbols, assumptions, and a derivation you can inspect.

Section prompt

Equation 1 $\mathcal C = (W, R, T, A, S, P).$ Equation 2 $h_\mathcal C = H(\operatorname{canon}(\mathcal C)).$

Let a serving-observability contract be

\mathcal C = (W, R, T, A, S, P).

Here $W$ is the workload shape, including request IDs and input/output token counts; $R$ is the runtime and configuration scope; $T$ is the required telemetry fields, including which fields are observed and which are derived; $A$ is the aggregation rule; $S$ is the goodput/SLO threshold contract; and $P$ is a dated pricing card or resource-rate model. The contract fingerprint is

h_\mathcal C = H(\operatorname{canon}(\mathcal C)).

The report artifact is separate:

G = (h_\mathcal C, \{r_i\}_{i=1}^N, M, E),

where $M$ is the metric table and $E$ is the evidence caveat. The report fingerprint is

h_G = H(\operatorname{canon}(G)).

For request $i$ , a minimal trace row can include

r_i = (n^{in}_i, n^{out}_i, \mathrm{TTFT}_i, \mathrm{TPOT}_i, \mathrm{E2E}_i, q_i, k_i).

$n^{in}_i$ and $n^{out}_i$ are input and output token counts, $\mathrm{TTFT}_i$ is time to first token, $\mathrm{TPOT}_i$ is time per output token or inter-token latency, $\mathrm{E2E}_i$ is end-to-end latency, $q_i$ is queue time, and $k_i$ is a cache or resource-pressure signal.

One toy latency decomposition is

\widetilde{\mathrm{E2E}}_i \approx \mathrm{TTFT}_i + (n^{out}_i - 1)\mathrm{TPOT}_i.

That derived value is a teaching handle, not an observed production profiler field. Real systems add streaming behavior, tokenization, network overhead, retries, tool calls, scheduling, preemption, batching, and provider/runtime implementation details.

For a percentile $p$ , write

Q_p(X)=\operatorname{quantile}_p(\{X_i\}_{i=1}^N).

The goodput under SLOs can be written as

\mathrm{Goodput}(\mathcal C) = \frac{1}{N} \sum_{i=1}^{N} \mathbf 1[ \mathrm{TTFT}_i \le s_{\mathrm{TTFT}} \land \mathrm{TPOT}_i \le s_{\mathrm{TPOT}} \land \widetilde{\mathrm{E2E}}_i \le s_{\mathrm{E2E}} ].

A toy API-cost readout might be

\widehat C_{\mathrm{api}} = \alpha_{in}\sum_i \frac{n^{in}_i}{1000} + \alpha_{out}\sum_i \frac{n^{out}_i}{1000},

where $\alpha_{in}$ and $\alpha_{out}$ come from a dated pricing card. That is not the full self-hosted cost. Infrastructure can also include idle GPUs, retries, failed requests, storage, networking, engineering time, orchestration, observability overhead, and utilization. If the workload $W$ , telemetry requirement $T$ , aggregation $A$ , SLO $S$ , or pricing/resource model $P$ changes, $h_\mathcal C$ changes and the comparison becomes a new measurement question.

Code

Keep the implementation aligned with the notation so the algorithm is legible.

Section prompt

Code witness 1import hashlib, json, math requests = [("r1", 420, 48), ("r2", 760, 64), ("r3", 1180, 80), ("...python

This short witness mirrors the math: one fixed contract, two traces, tail metrics, toy cost, goodput, and invalid-comparison cases.

import hashlib, json, math

requests = [("r1", 420, 48), ("r2", 760, 64), ("r3", 1180, 80), ("r4", 390, 32), ("r5", 1560, 120)]
contract = {
    "workload": requests,
    "aggregation": "nearest_rank_p50_p95",
    "pricing_date": "2026-06-30",
    "price": {"in_1k": 0.001, "out_1k": 0.0025},
    "slo_ms": {"ttft": 500, "tpot": 80, "derived_e2e": 12000},
}
baseline = [(180, 32, .18, 20), (240, 36, .27, 35), (310, 42, .41, 70), (420, 55, .25, 130), (760, 95, .88, 260)]
variant = [(175, 29, .17, 16), (218, 34, .25, 30), (285, 38, .39, 62), (350, 46, .22, 90), (490, 78, .72, 180)]

def fp(x):
    return hashlib.sha256(json.dumps(x, sort_keys=True).encode()).hexdigest()[:12]

def q(vals, p):
    vals = sorted(vals)
    return vals[max(0, math.ceil(p / 100 * len(vals)) - 1)]

def summarize(trace):
    rows = []
    for (rid, n_in, n_out), (ttft, tpot, cache_gb, queue_ms) in zip(requests, trace):
        derived_e2e = ttft + max(0, n_out - 1) * tpot
        cost = n_in / 1000 * contract["price"]["in_1k"] + n_out / 1000 * contract["price"]["out_1k"]
        rows.append(dict(id=rid, ttft=ttft, tpot=tpot, derived_e2e=derived_e2e, cache=cache_gb, queue=queue_ms, cost=cost))
    slo = contract["slo_ms"]
    good = [r for r in rows if r["ttft"] <= slo["ttft"] and r["tpot"] <= slo["tpot"] and r["derived_e2e"] <= slo["derived_e2e"]]
    metrics = {"p95_ttft": q([r["ttft"] for r in rows], 95), "p95_tpot": q([r["tpot"] for r in rows], 95), "goodput": len(good) / len(rows), "toy_cost": round(sum(r["cost"] for r in rows), 6)}
    report = {"contract": fp(contract), "rows": rows, "metrics": metrics, "caveat": "toy trace only"}
    return {**metrics, "contract": fp(contract), "report": fp(report)}

print("baseline", summarize(baseline))
print("variant", summarize(variant))
print("changed workload => new measurement question")
print("missing pricing date => invalid cost claim")
print("average-only latency => under-specified claim")

The full Level 2 witness also writes a report fingerprint, queue/cache saturation notes, explicit candidate verdicts, and checks against an expected JSON artifact.

Interactive Demo

Use direct manipulation to connect the explanation to a moving system.

Section prompt

Prediction check: use the widget below to decide whether the candidate evidence supports a comparable trace, new measurement question, or under-specified claim.

The pinned card names a five-request workload, TTFT/TPOT/derived-e2e/queue/cache telemetry, nearest-rank p50/p95 aggregation, request-level goodput thresholds, and a dated toy pricing card. Each case changes the evidence surface in a different way: one has the same measurement contract but a new measured trace, one quietly changes a request shape, one removes the dated price model, and one gives only average latency.

Commit to a verdict before reveal. The demo hides the contract hash, p95 values, goodput, toy cost, and report fingerprint until you choose. The sharp habit is to ask: which field in the measurement contract moved before the conclusion moved?

Live Concept Demo

Explore Cost and Latency Observability

The stage is code-native and interactive. Use it to test the explanation against the mechanism.

difficulty 4/5undergraduatecode-aligned

Demo Prediction Checkpoint

Manipulate one control and predict the visible change.

Commit to what Cost and Latency Observability should make visible before reading the result.

After The First Pass

Turn the concept into an inspected object.

Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.

Mechanism Storyboard

See the idea move before the page explains it

Cost and latency observability turns serving traces into pinned evidence: workload shape, token counts, tail latency, queue/cache signals, SLOs, pricing dates, and report artifacts.

Prediction open01 / Intuition

Prediction lens

Start with the picture, metaphor, or geometric mechanism.

Commit first

Before reading further, choose the kind of change Cost and Latency Observability should make visible.

Visual Inquiry

Make the image answer a mathematical question

Cost and latency observability turns serving traces into pinned evidence: workload shape, token counts, tail latency, queue/cache signals, SLOs, pricing dates, and report artifacts.

4/4 stages readyLive demo connected

Visual cueWhich visible object should carry the first intuition?

Inspection depth2/4

Prediction

Which visible object should carry the first intuition?

Commit first

Pick the cue that should make Cost and Latency Observability easier to reason about before the page gives the answer.

Source Grounding

Canonical references for the mechanism on this page.

documentation · 2026vLLM Documentation: MetricsvLLM Project

Source for vLLM request/server metric vocabulary: TTFT, inter-token latency/TPOT, e2e latency, prompt and generation tokens, queue/prefill/decode intervals, running/waiting requests, and KV cache usage.

Open source

documentation · 2026vLLM CLI Reference: vllm bench servevLLM Project

Source for percentile-metric names and request-level goodput SLO vocabulary in the vLLM benchmark CLI.

Open source

documentation · 2026OpenTelemetry Metrics Data ModelOpenTelemetry

Source for metrics as streams/time series, sums, gauges, histograms, temporality, and reaggregation concerns.

Open source

documentation · 2026Prometheus Documentation: Histograms and summariesPrometheus

Source for histogram/summary quantile and aggregation caveats, including why precomputed quantiles are not cleanly aggregatable.

Open source

book · 2016Site Reliability Engineering: Monitoring Distributed SystemsBeyer, Jones, Petoff, and Murphy, editors

Source for latency, traffic, errors, and saturation as core monitoring signals, plus the warning against relying on simple averages.

Open source

documentation · 2026MLPerf Inference: DatacenterMLCommons

Source for benchmark protocol discipline around inference scenarios, latency constraints, throughput metrics, power measurement, and compliance caveats.

Open source

Claim Review

Cost and latency observability turns serving traces into pinned evidence: workload shape, token counts, tail latency, queue/cache signals, SLOs, pricing dates, and report artifacts.

Status1 substantive review recorded

Claims without a substantive review badge still need exact source-support review.

Sources6 references

vllm-v1-metrics-design, vllm-bench-serve, opentelemetry-metrics-data-model, prometheus-histograms-summaries, google-sre-four-golden-signals, mlperf-inference-datacenter

Witnesses4 local objects

Use equation, code, and demo objects to check whether the source support is operational.

Substantively reviewedA serving cost/latency comparison is interpretable only to the extent that workload, telemetry, aggregation, goodput/SLO thresholds, pricing/resource card, window, report artifact, and caveats are explicit enough to scope the claim.Claim metadata: source checked

The sources jointly motivate treating serving observability as pinned, scoped trace evidence: named latency/token/cache metrics, request-level goodput thresholds, histogram/quantile aggregation rules, monitoring signals, and benchmark-protocol discipline.

Sources: vLLM Documentation: Metrics, vLLM CLI Reference: vllm bench serve, OpenTelemetry Metrics Data Model, Prometheus Documentation: Histograms and summaries, Site Reliability Engineering: Monitoring Distributed Systems, MLPerf Inference: DatacenterThe page teaches a synthetic trace contract and toy accounting witness. It does not claim live serving measurement, provider price accuracy, vLLM performance, production cost savings, hardware sizing, autoscaling guidance, benchmark standing, or capacity planning.A bounded review summary is present; still check caveats and exact source scope.

Primary-source review plus GPT-5.5 xhigh adversarial review support this caveated teaching synthesis after fixing witness SLO naming, derived-e2e semantics, stale expected JSON, and the same-workload cost overclaim. Details saved in responses/cost-latency-observability-source-support-review-20260630.md. In-app GPT Pro was unavailable because no browser target was exposed.

Reviewer: codex-primary-source-audit+gpt-5.5-xhigh-subagent; reviewed 2026-06-30

source-span-vllm-v1-metrics-design source-span-opentelemetry-metrics-data-model source-span-prometheus-histograms-summaries math-object-1 math-object-2 code-witness-1 interactive-demo

Source support candidates

documentation 2026vLLM Documentation: Metrics

documentation 2026vLLM CLI Reference: vllm bench serve

Source for percentile-metric names and request-level goodput SLO vocabulary in the vLLM benchmark CLI.

documentation 2026OpenTelemetry Metrics Data Model

Source for metrics as streams/time series, sums, gauges, histograms, temporality, and reaggregation concerns.

documentation 2026Prometheus Documentation: Histograms and summaries

Source for histogram/summary quantile and aggregation caveats, including why precomputed quantiles are not cleanly aggregatable.

Mechanism witnesses

Equation 1

\mathcal C = (W, R, T, A, S, P).

Equation 2

h_\mathcal C = H(\operatorname{canon}(\mathcal C)).

Code witness 1import hashlib, json, math requests = [("r1", 420, 48), ("r2", 760, 64), ("r3", 1180, 80), ("...Demo stateLive mechanism probe

Practice Loop

Try the idea before it explains itself

Cost and latency observability turns serving traces into pinned evidence: workload shape, token counts, tail latency, queue/cache signals, SLOs, pricing dates, and report artifacts.

Readiness0/3 checks ready

Predict

Before touching the demo, predict one visible change that should happen in Cost and Latency Observability.

Hint 1

Reveal when your model needs a nudge.

Hint 2

Reveal when your model needs a nudge.

Hint 3

Reveal when your model needs a nudge.

Your answer canvas

Local checks

Claim

A concrete answer is on the canvas.

Mechanism

The answer names why the claim should hold.

Bridge

It touches the page context or a neighboring idea.

Misconception check

Object research drawerClose

ConceptCost and Latency ObservabilityProduction ML

Code witness comparisonCost and Latency Observability code witness 1requests = [("r1", 420, 48), ("r2", 760, 64), ("r3", 1180, 80), ("r4", 390, 32), ("r5", 1560, 120)]Prediction before revealCost and Latency Observability interactive demoManipulate one control and predict the visible change.

Grounded room questionWhat is the smallest example that makes Cost and Latency Observability click without losing the math?Local snapshot ready

Research Room

Attach the question to an exact object

Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.

Next local actionNo local draft saved yet

Open the draft below to save one note and next action in this browser.

conceptProduction ML

Cost and Latency Observability

Anchored question

What is the smallest example that makes Cost and Latency Observability click without losing the math?

Local action draftNo local draft saved yetExpand only when ready to capture one local next action

Local action draft

This draft stays locally in this browser for concept:production-ml/cost-latency-observability.

Draft noteNext action

No local draft saved.

Evidence to inspect

Source ids to inspect: vllm-v1-metrics-design, vllm-bench-serve, opentelemetry-metrics-data-model, prometheus-histograms-summaries, google-sre-four-golden-signals, mlperf-inference-datacenter
Definition, prerequisite, and contrast concept links
The equation or code witness that makes the concept operational
One demo state that shows the invariant instead of a slogan

What would resolve this

The learner can state the mechanism in their own words
The learner can name the prerequisite that would repair confusion
The learner can predict how the mechanism changes under one perturbation

Grounded AI handoff

I am working in Continuous Function's research reading room. Object: concept - Cost and Latency Observability Object key: concept:production-ml/cost-latency-observability Context: Production ML Anchor id: concept/concept-notebook/production-ml/cost-latency-observability Open question: What is the smallest example that makes Cost and Latency Observability click without losing the math? Evidence to inspect: - Source ids to inspect: vllm-v1-metrics-design, vllm-bench-serve, opentelemetry-metrics-data-model, prometheus-histograms-summaries, google-sre-four-golden-signals, mlperf-inference-datacenter - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

Open source object

concept/concept-notebook/production-ml/cost-latency-observability
concept:production-ml/cost-latency-observability

Learning Map

Before / Now / Try / Next

BeforeLLM Serving at Scale: Prefill, Decode & Continuous Batching

NowIntuition → Math → Code → Demo

TryManipulate one control and predict the visible change.

NextDataset Versioning

Intuitionready
Mathready
Codeready
Interactive Demoready

Object Companion

Ask beside the selected object

Cost and latency observability turns serving traces into pinned evidence: workload shape, token counts, tail latency, queue/cache signals, SLOs, pricing dates, and report artifacts.

Your question

GoalComfortStyleStuck on

Context prompt

You are my AI learning companion for Continuous Function. Current context: Production ML concept. Learning surface: Cost and Latency Observability. What this page says: Cost and latency observability turns serving traces into pinned evidence: workload shape, token counts, tail latency, queue/cache signals, SLOs, pricing dates, and report artifacts. Current section: Intuition, math, code, and interactive demo. Suggested next step: Manipulate one control and predict the visible change.. Learner goal: Understand the idea. Learner comfort level: New to this. Preferred explanation style: Visual first. Task: Explain the central idea in plain language, then restate it with the exact math objects from the page. Answer in a way that helps me learn: ask one clarifying question only if needed, use intuition before notation, and end with one thing I should try on the page.

Domain

Production ML

production-mlobservabilityservinglatencycostevaluation

Prerequisites

LLM Serving at Scale: Prefill, Decode & Continuous Batching Evaluation Pipelines Tokenization & Vocabulary Design

Leads To

Llmops Prompt Rag Agent Tool Eval (planned)

Dataset Versioning (review)Evaluation Harnesses and Benchmark Contamination Long Context Engineering: RoPE Scaling, KV Compression & Memory Optimization

Within this domain