Production ML

Cost and Latency Observability

Cost and latency observability turns serving traces into pinned evidence: workload shape, token counts, tail latency, queue/cache signals, SLOs, pricing dates, and report artifacts.

status: publishedimportance: criticaldifficulty 4/5math: undergraduateread: 16mlive demo

Concept Structure

Cost and Latency Observability

01Intuition

Start with the picture, metaphor, or geometric mechanism.

02Math

Make the objects explicit and connect them with notation.

03Code

Mirror the equations with runnable implementation details.

04Interactive Demo

Manipulate the mechanism and watch the idea respond.

3prerequisites
1next concepts
3related links

Learner Contract

What this page should let you do.

You are here becauseCost and latency observability turns serving traces into pinned evidence: workload shape, token counts, tail latency, queue/cache signals, SLOs, pricing dates, and report artifacts.

This Production ML concept is the current object: keep the same invariant visible across Intuition, Math, Code, Interactive Demo.

By the end4/4 sections ready | code witness expected | live demo

Explain the mechanism, trace the main notation, and test one prediction in the live demo.

Do this firstIntuition

Read the intuition before the notation; the math should name a mechanism you already felt.

Then go nextDataset Versioning

Follow this edge after making one prediction here; the next page should reuse the result, not restart the route.

Test the linkManipulate one control and predict the visible change.Then continue to Dataset Versioning

Claim/source review status

Substantive review recorded

1/1 claims have bounded review metadata; still check caveats and source scope.Metadata-derived; review may be AI-assisted. Not a human certification.
Claims1/1 reviewed
Sources6 cited
Codeattached
Demolive
Reviewed2026-06-30
Updatedpage 2026-06-30

Object flow

4/4 sections readyAsk about thisResearch room
ConceptCost and Latency ObservabilityProduction ML
6 sources attachedLocal snapshot ready
concept:production-ml/cost-latency-observability
01

01

Intuition

Build the mental picture first so the rest of the page has something to attach to.

Section prompt

A serving dashboard can make a system feel measurable before it is actually comparable. You see one latency number, a token count, maybe a bill, and it is tempting to say, "Variant B is cheaper and faster."

The missing question is: cheaper and faster under what measurement contract?

For LLM serving, the contract includes the request mix, input and output token counts, runtime/config scope, required telemetry fields, aggregation rule, request-level goodput thresholds, pricing or resource-rate card, and measurement window. The report artifact then carries that contract fingerprint, the trace rows, the computed metrics, and the caveats. If a contract field moves quietly, the cost or latency number is answering a different question.

Cost and latency observability is not just dashboard decoration. It is evaluation-pipeline discipline for systems tradeoffs. A comparable claim says: here is the workload, here is the trace, here are the latency distributions, here is the token/resource accounting, here is the dated pricing card or resource-rate model, here is the report fingerprint, and here is what this evidence cannot prove.

This page is deliberately not a provider cost calculator and not a live vLLM benchmark. The Level 2 witness at content/research-rooms/attention-to-serving/level-2/cost_latency_observability_witness.py is a synthetic trace-contract artifact. It teaches what has to be pinned before a faster/cheaper serving claim becomes evidence.

02

02

Math

Translate the story into symbols, assumptions, and a derivation you can inspect.

Section prompt

Let a serving-observability contract be

C=(W,R,T,A,S,P).\mathcal C = (W, R, T, A, S, P).

Here WW is the workload shape, including request IDs and input/output token counts; RR is the runtime and configuration scope; TT is the required telemetry fields, including which fields are observed and which are derived; AA is the aggregation rule; SS is the goodput/SLO threshold contract; and PP is a dated pricing card or resource-rate model. The contract fingerprint is

hC=H(canon(C)).h_\mathcal C = H(\operatorname{canon}(\mathcal C)).

The report artifact is separate:

G=(hC,{ri}i=1N,M,E),G = (h_\mathcal C, \{r_i\}_{i=1}^N, M, E),

where MM is the metric table and EE is the evidence caveat. The report fingerprint is

hG=H(canon(G)).h_G = H(\operatorname{canon}(G)).

For request ii, a minimal trace row can include

ri=(niin,niout,TTFTi,TPOTi,E2Ei,qi,ki).r_i = (n^{in}_i, n^{out}_i, \mathrm{TTFT}_i, \mathrm{TPOT}_i, \mathrm{E2E}_i, q_i, k_i).

niinn^{in}_i and nioutn^{out}_i are input and output token counts, TTFTi\mathrm{TTFT}_i is time to first token, TPOTi\mathrm{TPOT}_i is time per output token or inter-token latency, E2Ei\mathrm{E2E}_i is end-to-end latency, qiq_i is queue time, and kik_i is a cache or resource-pressure signal.

One toy latency decomposition is

E2E~iTTFTi+(niout1)TPOTi.\widetilde{\mathrm{E2E}}_i \approx \mathrm{TTFT}_i + (n^{out}_i - 1)\mathrm{TPOT}_i.

That derived value is a teaching handle, not an observed production profiler field. Real systems add streaming behavior, tokenization, network overhead, retries, tool calls, scheduling, preemption, batching, and provider/runtime implementation details.

For a percentile pp, write

Qp(X)=quantilep({Xi}i=1N).Q_p(X)=\operatorname{quantile}_p(\{X_i\}_{i=1}^N).

The goodput under SLOs can be written as

Goodput(C)=1Ni=1N1[TTFTisTTFTTPOTisTPOTE2E~isE2E].\mathrm{Goodput}(\mathcal C) = \frac{1}{N} \sum_{i=1}^{N} \mathbf 1[ \mathrm{TTFT}_i \le s_{\mathrm{TTFT}} \land \mathrm{TPOT}_i \le s_{\mathrm{TPOT}} \land \widetilde{\mathrm{E2E}}_i \le s_{\mathrm{E2E}} ].

A toy API-cost readout might be

C^api=αininiin1000+αoutiniout1000,\widehat C_{\mathrm{api}} = \alpha_{in}\sum_i \frac{n^{in}_i}{1000} + \alpha_{out}\sum_i \frac{n^{out}_i}{1000},

where αin\alpha_{in} and αout\alpha_{out} come from a dated pricing card. That is not the full self-hosted cost. Infrastructure can also include idle GPUs, retries, failed requests, storage, networking, engineering time, orchestration, observability overhead, and utilization. If the workload WW, telemetry requirement TT, aggregation AA, SLO SS, or pricing/resource model PP changes, hCh_\mathcal C changes and the comparison becomes a new measurement question.

03

03

Code

Keep the implementation aligned with the notation so the algorithm is legible.

Section prompt

This short witness mirrors the math: one fixed contract, two traces, tail metrics, toy cost, goodput, and invalid-comparison cases.

import hashlib, json, math

requests = [("r1", 420, 48), ("r2", 760, 64), ("r3", 1180, 80), ("r4", 390, 32), ("r5", 1560, 120)]
contract = {
    "workload": requests,
    "aggregation": "nearest_rank_p50_p95",
    "pricing_date": "2026-06-30",
    "price": {"in_1k": 0.001, "out_1k": 0.0025},
    "slo_ms": {"ttft": 500, "tpot": 80, "derived_e2e": 12000},
}
baseline = [(180, 32, .18, 20), (240, 36, .27, 35), (310, 42, .41, 70), (420, 55, .25, 130), (760, 95, .88, 260)]
variant = [(175, 29, .17, 16), (218, 34, .25, 30), (285, 38, .39, 62), (350, 46, .22, 90), (490, 78, .72, 180)]

def fp(x):
    return hashlib.sha256(json.dumps(x, sort_keys=True).encode()).hexdigest()[:12]

def q(vals, p):
    vals = sorted(vals)
    return vals[max(0, math.ceil(p / 100 * len(vals)) - 1)]

def summarize(trace):
    rows = []
    for (rid, n_in, n_out), (ttft, tpot, cache_gb, queue_ms) in zip(requests, trace):
        derived_e2e = ttft + max(0, n_out - 1) * tpot
        cost = n_in / 1000 * contract["price"]["in_1k"] + n_out / 1000 * contract["price"]["out_1k"]
        rows.append(dict(id=rid, ttft=ttft, tpot=tpot, derived_e2e=derived_e2e, cache=cache_gb, queue=queue_ms, cost=cost))
    slo = contract["slo_ms"]
    good = [r for r in rows if r["ttft"] <= slo["ttft"] and r["tpot"] <= slo["tpot"] and r["derived_e2e"] <= slo["derived_e2e"]]
    metrics = {"p95_ttft": q([r["ttft"] for r in rows], 95), "p95_tpot": q([r["tpot"] for r in rows], 95), "goodput": len(good) / len(rows), "toy_cost": round(sum(r["cost"] for r in rows), 6)}
    report = {"contract": fp(contract), "rows": rows, "metrics": metrics, "caveat": "toy trace only"}
    return {**metrics, "contract": fp(contract), "report": fp(report)}

print("baseline", summarize(baseline))
print("variant", summarize(variant))
print("changed workload => new measurement question")
print("missing pricing date => invalid cost claim")
print("average-only latency => under-specified claim")

The full Level 2 witness also writes a report fingerprint, queue/cache saturation notes, explicit candidate verdicts, and checks against an expected JSON artifact.

04

04

Interactive Demo

Use direct manipulation to connect the explanation to a moving system.

Section prompt

Prediction check: use the widget below to decide whether the candidate evidence supports a comparable trace, new measurement question, or under-specified claim.

The pinned card names a five-request workload, TTFT/TPOT/derived-e2e/queue/cache telemetry, nearest-rank p50/p95 aggregation, request-level goodput thresholds, and a dated toy pricing card. Each case changes the evidence surface in a different way: one has the same measurement contract but a new measured trace, one quietly changes a request shape, one removes the dated price model, and one gives only average latency.

Commit to a verdict before reveal. The demo hides the contract hash, p95 values, goodput, toy cost, and report fingerprint until you choose. The sharp habit is to ask: which field in the measurement contract moved before the conclusion moved?

Live Concept Demo

Explore Cost and Latency Observability

The stage is code-native and interactive. Use it to test the explanation against the mechanism.

difficulty 4/5undergraduatecode-aligned
Demo Prediction Checkpoint

Manipulate one control and predict the visible change.

Commit to what Cost and Latency Observability should make visible before reading the result.

After The First Pass

Turn the concept into an inspected object.

Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.

Mechanism Storyboard

See the idea move before the page explains it

Cost and latency observability turns serving traces into pinned evidence: workload shape, token counts, tail latency, queue/cache signals, SLOs, pricing dates, and report artifacts.

Prediction open01 / Intuition
Prediction lens

Start with the picture, metaphor, or geometric mechanism.

Commit first

Before reading further, choose the kind of change Cost and Latency Observability should make visible.

Visual Inquiry

Make the image answer a mathematical question

Cost and latency observability turns serving traces into pinned evidence: workload shape, token counts, tail latency, queue/cache signals, SLOs, pricing dates, and report artifacts.

4/4 stages readyLive demo connected
Prediction

Which visible object should carry the first intuition?

Commit first

Pick the cue that should make Cost and Latency Observability easier to reason about before the page gives the answer.

Source Grounding

Canonical references for the mechanism on this page.

documentation · 2026vLLM Documentation: MetricsvLLM Project

Source for vLLM request/server metric vocabulary: TTFT, inter-token latency/TPOT, e2e latency, prompt and generation tokens, queue/prefill/decode intervals, running/waiting requests, and KV cache usage.

Open source
documentation · 2026vLLM CLI Reference: vllm bench servevLLM Project

Source for percentile-metric names and request-level goodput SLO vocabulary in the vLLM benchmark CLI.

Open source
documentation · 2026OpenTelemetry Metrics Data ModelOpenTelemetry

Source for metrics as streams/time series, sums, gauges, histograms, temporality, and reaggregation concerns.

Open source
documentation · 2026Prometheus Documentation: Histograms and summariesPrometheus

Source for histogram/summary quantile and aggregation caveats, including why precomputed quantiles are not cleanly aggregatable.

Open source
book · 2016Site Reliability Engineering: Monitoring Distributed SystemsBeyer, Jones, Petoff, and Murphy, editors

Source for latency, traffic, errors, and saturation as core monitoring signals, plus the warning against relying on simple averages.

Open source
documentation · 2026MLPerf Inference: DatacenterMLCommons

Source for benchmark protocol discipline around inference scenarios, latency constraints, throughput metrics, power measurement, and compliance caveats.

Open source

Claim Review

Cost and latency observability turns serving traces into pinned evidence: workload shape, token counts, tail latency, queue/cache signals, SLOs, pricing dates, and report artifacts.

Status1 substantive review recorded

Claims without a substantive review badge still need exact source-support review.

Sources6 references

vllm-v1-metrics-design, vllm-bench-serve, opentelemetry-metrics-data-model, prometheus-histograms-summaries, google-sre-four-golden-signals, mlperf-inference-datacenter

Witnesses4 local objects

Use equation, code, and demo objects to check whether the source support is operational.

Substantively reviewedA serving cost/latency comparison is interpretable only to the extent that workload, telemetry, aggregation, goodput/SLO thresholds, pricing/resource card, window, report artifact, and caveats are explicit enough to scope the claim.Claim metadata: source checked

The sources jointly motivate treating serving observability as pinned, scoped trace evidence: named latency/token/cache metrics, request-level goodput thresholds, histogram/quantile aggregation rules, monitoring signals, and benchmark-protocol discipline.

Sources: vLLM Documentation: Metrics, vLLM CLI Reference: vllm bench serve, OpenTelemetry Metrics Data Model, Prometheus Documentation: Histograms and summaries, Site Reliability Engineering: Monitoring Distributed Systems, MLPerf Inference: DatacenterThe page teaches a synthetic trace contract and toy accounting witness. It does not claim live serving measurement, provider price accuracy, vLLM performance, production cost savings, hardware sizing, autoscaling guidance, benchmark standing, or capacity planning.A bounded review summary is present; still check caveats and exact source scope.

Primary-source review plus GPT-5.5 xhigh adversarial review support this caveated teaching synthesis after fixing witness SLO naming, derived-e2e semantics, stale expected JSON, and the same-workload cost overclaim. Details saved in responses/cost-latency-observability-source-support-review-20260630.md. In-app GPT Pro was unavailable because no browser target was exposed.

Reviewer: codex-primary-source-audit+gpt-5.5-xhigh-subagent; reviewed 2026-06-30

Practice Loop

Try the idea before it explains itself

Cost and latency observability turns serving traces into pinned evidence: workload shape, token counts, tail latency, queue/cache signals, SLOs, pricing dates, and report artifacts.

Readiness0/3 checks ready
Predict

Before touching the demo, predict one visible change that should happen in Cost and Latency Observability.

Hint 1

Reveal when your model needs a nudge.

Hint 2

Reveal when your model needs a nudge.

Hint 3

Reveal when your model needs a nudge.

Object research drawerClose
ConceptCost and Latency ObservabilityProduction ML

Research Room

Attach the question to an exact object

Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.
Next local actionNo local draft saved yet

Open the draft below to save one note and next action in this browser.

conceptProduction ML

Cost and Latency Observability

Anchored question

What is the smallest example that makes Cost and Latency Observability click without losing the math?

Local action draftNo local draft saved yetExpand only when ready to capture one local next action
Local action draft

This draft stays locally in this browser for concept:production-ml/cost-latency-observability.

No local draft saved.
Evidence to inspect
  • Source ids to inspect: vllm-v1-metrics-design, vllm-bench-serve, opentelemetry-metrics-data-model, prometheus-histograms-summaries, google-sre-four-golden-signals, mlperf-inference-datacenter
  • Definition, prerequisite, and contrast concept links
  • The equation or code witness that makes the concept operational
  • One demo state that shows the invariant instead of a slogan
What would resolve this
  • The learner can state the mechanism in their own words
  • The learner can name the prerequisite that would repair confusion
  • The learner can predict how the mechanism changes under one perturbation
Grounded AI handoff

I am working in Continuous Function's research reading room. Object: concept - Cost and Latency Observability Object key: concept:production-ml/cost-latency-observability Context: Production ML Anchor id: concept/concept-notebook/production-ml/cost-latency-observability Open question: What is the smallest example that makes Cost and Latency Observability click without losing the math? Evidence to inspect: - Source ids to inspect: vllm-v1-metrics-design, vllm-bench-serve, opentelemetry-metrics-data-model, prometheus-histograms-summaries, google-sre-four-golden-signals, mlperf-inference-datacenter - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

Open source object
concept/concept-notebook/production-ml/cost-latency-observability concept:production-ml/cost-latency-observability