LLM Calibration & Reliability Simulation

The core problem: When an LLM says it is "85% confident", is it actually correct 85% of the time? Without calibration, the answer is almost always no. Miscalibrated confidence means your SLOs are built on lies. This simulation walks through the full pipeline: raw LLM → calibration → monitoring → recalibration → SLOs.

🤖

LLM Query

→

📊

Raw Confidence

→

🔧

Isotonic
Calibration

→

✅

Calibrated
Confidence

→

📈

ECE / Brier
Monitoring

→

🚨

Drift
Detection

→

🎯

SLI / SLO
Enforcement

📐 Isotonic Calibration

A monotone, non-parametric method that maps raw model scores to calibrated probabilities. It fits a step-wise increasing function using the Pool Adjacent Violators (PAV) algorithm on a held-out validation set.

P(correct | confidence = c) ≈ c
After calibration, this identity holds.

📉 Expected Calibration Error

ECE bins predictions by confidence and measures the gap between average confidence and actual accuracy per bin, weighted by bin count.

ECE = Σ (|B_m| / n) × |acc(B_m) − conf(B_m)|

🎯 SLO from Calibration

Calibration metrics become Service Level Indicators (SLIs). We set SLOs like "ECE < 0.10 over a 30-day window" and burn-rate alerts trigger recalibration.

SLI = 1 − ECE
SLO: SLI ≥ 0.90 over 30d

🧠 Research Foundation

Guo et al. (2017) — "On Calibration of Modern Neural Networks" demonstrated that modern deep networks are poorly calibrated despite high accuracy. Temperature scaling and isotonic regression were shown as effective post-hoc fixes.

Niculescu-Mizil & Caruana (2005) — "Predicting Good Probabilities With Supervised Learning" established that isotonic regression outperforms Platt scaling for calibrating neural network outputs.

Wang et al. (2022) — "Self-Consistency Improves Chain of Thought Reasoning" showed that sampling multiple LLM responses and measuring agreement yields calibrated confidence estimates.

Our approach — Multi-signal confidence estimation (self-consistency 60%, linguistic markers 20%, coherence 15%, specificity 5%) + isotonic calibration + SLO enforcement.

How it works: Click "Start Simulation" to stream simulated LLM queries. Each query gets a raw confidence score (often miscalibrated) and the system checks whether the response was actually correct. Watch the reliability diagram and ECE diverge from perfect calibration — then apply isotonic calibration to fix it.

Queries Processed

0

ECE (Calibration Error)

0.000

Avg Accuracy

0.0%

High-Conf Failures

0

📊 Reliability Diagram (Live)

Bars = actual accuracy per bin · Diagonal = perfect calibration · Gap = miscalibration

📈 ECE Over Time

Expected Calibration Error tracked across the simulation window

📦 Confidence Distribution

Histogram of model confidence values and per-bin accuracy

📝 Query Log

Real-time feed of simulated LLM queries

Isotonic Calibration uses the Pool Adjacent Violators (PAV) algorithm to learn a monotone mapping from raw confidence → calibrated probability. Below: generate uncalibrated data, see the miscalibration, then apply isotonic regression and watch the reliability diagram snap to the diagonal.

ECE Before Calibration

—

ECE After Calibration

—

Before Calibration

Raw model confidence vs. actual accuracy per bin

After Isotonic Calibration

Calibrated confidence — bars align with the diagonal

🔍 Isotonic Mapping Function

The learned monotone mapping from raw confidence → calibrated probability

🧪 How Isotonic Regression Works (PAV Algorithm)

Step 1: Sort all predictions by raw confidence score.
Step 2: Pair each raw score with its binary outcome (correct/incorrect).
Step 3: Walk through sorted pairs. If the running average decreases (violates monotonicity), merge with previous block and re-average.
Step 4: The result is a step-wise non-decreasing function that maps any raw score to a calibrated probability.

Pool Adjacent Violators:
for i in 1..n:
  while block[i].value < block[i−1].value:
    merge block[i] into block[i−1]
    block[i−1].value = weighted_avg

Result: f(raw_conf) → calibrated_prob
// monotone, non-parametric, distribution-free

Calibration Drift happens when the data distribution shifts — model updates, prompt changes, or new user patterns. This section simulates normal operation → drift injection → detection → automatic recalibration. Watch the ECE climb past the SLO threshold, triggering an alert and recalibration workflow.

🔄 Scenario Progress

✅

Normal
Operation

→

⚡

Drift
Injected

→

🚨

SLO Breach
Detected

→

🔧

Recalibration
Triggered

→

🎉

Restored
Healthy

📈 ECE Over Scenario Timeline

Watch ECE rise during drift, then drop after recalibration

📝 Incident Log

Automated drift detection and recalibration events

Pre-Drift ECE

—

Peak Drift ECE

—

Post-Recalibration ECE

—

Building SLIs & SLOs from Calibration: Traditional SLOs measure latency and error rate. For LLMs, we need semantic reliability — is the model's confidence score actually meaningful? Calibration metrics (ECE, Brier Score, high-confidence failure rate) become SLIs that SRE teams can monitor and alert on.

📐 SLI Definitions

SLI₁: Calibration Quality

SLI = 1 − ECE
Target: ≥ 0.90

Measures how well confidence predicts correctness. ECE < 0.10 = well-calibrated.

SLI₂: High-Conf Reliability

SLI = 1 − (HCF / HC_total)
Target: ≥ 0.95

Fraction of high-confidence (≥80%) predictions that are actually correct.

SLI₃: Brier Score

SLI = 1 − Brier
Target: ≥ 0.80

Combined accuracy + calibration measure. Lower Brier = better predictions.

—

Calibration SLO

Target: 90.0% · Window: 30d

—

High-Conf SLO

Target: 95.0% · Window: 30d

—

Brier SLO

Target: 80.0% · Window: 30d

—

📈 SLI Tracking (30-Day Window)

All three SLIs over a simulated 30-day period

🔥 Error Budget Burn

Remaining error budget for each SLO over the window

🚨 Alert Rules (Burn-Rate Based)

Page (Critical): 14.4× burn rate over 1h AND 6× over 6h
→ ECE > 0.15 sustained = immediate page

Ticket (Warning): 3× burn rate over 3d
→ ECE trending toward 0.10 = create ticket

Auto-Recalibrate: ECE crosses 0.12
→ Trigger isotonic recalibration pipeline

Burn Rate:
burn_rate = error_rate / (1 − SLO_target)

Error Budget:
budget = (1 − SLO_target) × window_minutes
remaining = budget − consumed

Example (ECE SLO = 0.90):
budget = 0.10 × 43,200 min = 4,320 min
If ECE = 0.15 → burn_rate = 0.15/0.10 = 1.5×