🔬 LLM Calibration & Reliability Observatory

Interactive simulation showing how isotonic calibration makes LLM confidence scores trustable, how to detect calibration drift, trigger recalibration, and build production SLIs/SLOs for LLM reliability.

① Overview
② Live LLM Simulation
③ Isotonic Calibration
④ Drift & Recalibration
⑤ SLI/SLO Builder
The core problem: When an LLM says it is "85% confident", is it actually correct 85% of the time? Without calibration, the answer is almost always no. Miscalibrated confidence means your SLOs are built on lies. This simulation walks through the full pipeline: raw LLM → calibration → monitoring → recalibration → SLOs.
🤖
LLM Query
📊
Raw Confidence
🔧
Isotonic
Calibration
Calibrated
Confidence
📈
ECE / Brier
Monitoring
🚨
Drift
Detection
🎯
SLI / SLO
Enforcement

📐 Isotonic Calibration

A monotone, non-parametric method that maps raw model scores to calibrated probabilities. It fits a step-wise increasing function using the Pool Adjacent Violators (PAV) algorithm on a held-out validation set.

P(correct | confidence = c) ≈ c
After calibration, this identity holds.

📉 Expected Calibration Error

ECE bins predictions by confidence and measures the gap between average confidence and actual accuracy per bin, weighted by bin count.

ECE = Σ (|Bm| / n) × |acc(Bm) − conf(Bm)|

🎯 SLO from Calibration

Calibration metrics become Service Level Indicators (SLIs). We set SLOs like "ECE < 0.10 over a 30-day window" and burn-rate alerts trigger recalibration.

SLI = 1 − ECE
SLO: SLI ≥ 0.90 over 30d

🧠 Research Foundation

Guo et al. (2017) — "On Calibration of Modern Neural Networks" demonstrated that modern deep networks are poorly calibrated despite high accuracy. Temperature scaling and isotonic regression were shown as effective post-hoc fixes.

Niculescu-Mizil & Caruana (2005) — "Predicting Good Probabilities With Supervised Learning" established that isotonic regression outperforms Platt scaling for calibrating neural network outputs.

Wang et al. (2022) — "Self-Consistency Improves Chain of Thought Reasoning" showed that sampling multiple LLM responses and measuring agreement yields calibrated confidence estimates.

Our approach — Multi-signal confidence estimation (self-consistency 60%, linguistic markers 20%, coherence 15%, specificity 5%) + isotonic calibration + SLO enforcement.

How it works: Click "Start Simulation" to stream simulated LLM queries. Each query gets a raw confidence score (often miscalibrated) and the system checks whether the response was actually correct. Watch the reliability diagram and ECE diverge from perfect calibration — then apply isotonic calibration to fix it.
Queries Processed
0
ECE (Calibration Error)
0.000
Avg Accuracy
0.0%
High-Conf Failures
0

📊 Reliability Diagram (Live)

Bars = actual accuracy per bin · Diagonal = perfect calibration · Gap = miscalibration

📈 ECE Over Time

Expected Calibration Error tracked across the simulation window

📦 Confidence Distribution

Histogram of model confidence values and per-bin accuracy

📝 Query Log

Real-time feed of simulated LLM queries

Isotonic Calibration uses the Pool Adjacent Violators (PAV) algorithm to learn a monotone mapping from raw confidence → calibrated probability. Below: generate uncalibrated data, see the miscalibration, then apply isotonic regression and watch the reliability diagram snap to the diagonal.
ECE Before Calibration
ECE After Calibration

Before Calibration

Raw model confidence vs. actual accuracy per bin

After Isotonic Calibration

Calibrated confidence — bars align with the diagonal

🔍 Isotonic Mapping Function

The learned monotone mapping from raw confidence → calibrated probability

🧪 How Isotonic Regression Works (PAV Algorithm)

Step 1: Sort all predictions by raw confidence score.
Step 2: Pair each raw score with its binary outcome (correct/incorrect).
Step 3: Walk through sorted pairs. If the running average decreases (violates monotonicity), merge with previous block and re-average.
Step 4: The result is a step-wise non-decreasing function that maps any raw score to a calibrated probability.

Pool Adjacent Violators:
for i in 1..n:
  while block[i].value < block[i−1].value:
    merge block[i] into block[i−1]
    block[i−1].value = weighted_avg

Result: f(raw_conf) → calibrated_prob
// monotone, non-parametric, distribution-free
Calibration Drift happens when the data distribution shifts — model updates, prompt changes, or new user patterns. This section simulates normal operation → drift injection → detection → automatic recalibration. Watch the ECE climb past the SLO threshold, triggering an alert and recalibration workflow.

🔄 Scenario Progress

Normal
Operation
Drift
Injected
🚨
SLO Breach
Detected
🔧
Recalibration
Triggered
🎉
Restored
Healthy

📈 ECE Over Scenario Timeline

Watch ECE rise during drift, then drop after recalibration

📝 Incident Log

Automated drift detection and recalibration events

Pre-Drift ECE
Peak Drift ECE
Post-Recalibration ECE
Building SLIs & SLOs from Calibration: Traditional SLOs measure latency and error rate. For LLMs, we need semantic reliability — is the model's confidence score actually meaningful? Calibration metrics (ECE, Brier Score, high-confidence failure rate) become SLIs that SRE teams can monitor and alert on.

📐 SLI Definitions

SLI₁: Calibration Quality

SLI = 1 − ECE
Target: ≥ 0.90

Measures how well confidence predicts correctness. ECE < 0.10 = well-calibrated.

SLI₂: High-Conf Reliability

SLI = 1 − (HCF / HC_total)
Target: ≥ 0.95

Fraction of high-confidence (≥80%) predictions that are actually correct.

SLI₃: Brier Score

SLI = 1 − Brier
Target: ≥ 0.80

Combined accuracy + calibration measure. Lower Brier = better predictions.

Calibration SLO

Target: 90.0% · Window: 30d

High-Conf SLO

Target: 95.0% · Window: 30d

Brier SLO

Target: 80.0% · Window: 30d

📈 SLI Tracking (30-Day Window)

All three SLIs over a simulated 30-day period

🔥 Error Budget Burn

Remaining error budget for each SLO over the window

🚨 Alert Rules (Burn-Rate Based)

Page (Critical): 14.4× burn rate over 1h AND 6× over 6h
→ ECE > 0.15 sustained = immediate page

Ticket (Warning): 3× burn rate over 3d
→ ECE trending toward 0.10 = create ticket

Auto-Recalibrate: ECE crosses 0.12
→ Trigger isotonic recalibration pipeline

Burn Rate:
burn_rate = error_rate / (1 − SLO_target)

Error Budget:
budget = (1 − SLO_target) × window_minutes
remaining = budget − consumed

Example (ECE SLO = 0.90):
budget = 0.10 × 43,200 min = 4,320 min
If ECE = 0.15 → burn_rate = 0.15/0.10 = 1.5×