Interactive simulation showing how isotonic calibration makes LLM confidence scores trustable, how to detect calibration drift, trigger recalibration, and build production SLIs/SLOs for LLM reliability.
A monotone, non-parametric method that maps raw model scores to calibrated probabilities. It fits a step-wise increasing function using the Pool Adjacent Violators (PAV) algorithm on a held-out validation set.
ECE bins predictions by confidence and measures the gap between average confidence and actual accuracy per bin, weighted by bin count.
Calibration metrics become Service Level Indicators (SLIs). We set SLOs like "ECE < 0.10 over a 30-day window" and burn-rate alerts trigger recalibration.
Guo et al. (2017) — "On Calibration of Modern Neural Networks" demonstrated that modern deep networks are poorly calibrated despite high accuracy. Temperature scaling and isotonic regression were shown as effective post-hoc fixes.
Niculescu-Mizil & Caruana (2005) — "Predicting Good Probabilities With Supervised Learning" established that isotonic regression outperforms Platt scaling for calibrating neural network outputs.
Wang et al. (2022) — "Self-Consistency Improves Chain of Thought Reasoning" showed that sampling multiple LLM responses and measuring agreement yields calibrated confidence estimates.
Our approach — Multi-signal confidence estimation (self-consistency 60%, linguistic markers 20%, coherence 15%, specificity 5%) + isotonic calibration + SLO enforcement.
Bars = actual accuracy per bin · Diagonal = perfect calibration · Gap = miscalibration
Expected Calibration Error tracked across the simulation window
Histogram of model confidence values and per-bin accuracy
Real-time feed of simulated LLM queries
Raw model confidence vs. actual accuracy per bin
Calibrated confidence — bars align with the diagonal
The learned monotone mapping from raw confidence → calibrated probability
Step 1: Sort all predictions by raw confidence score.
Step 2: Pair each raw score with its binary outcome (correct/incorrect).
Step 3: Walk through sorted pairs. If the running average decreases (violates monotonicity), merge with previous block and re-average.
Step 4: The result is a step-wise non-decreasing function that maps any raw score to a calibrated probability.
Watch ECE rise during drift, then drop after recalibration
Automated drift detection and recalibration events
Measures how well confidence predicts correctness. ECE < 0.10 = well-calibrated.
Fraction of high-confidence (≥80%) predictions that are actually correct.
Combined accuracy + calibration measure. Lower Brier = better predictions.
Target: 90.0% · Window: 30d
Target: 95.0% · Window: 30d
Target: 80.0% · Window: 30d
All three SLIs over a simulated 30-day period
Remaining error budget for each SLO over the window
Page (Critical): 14.4× burn rate over 1h AND 6× over 6h
→ ECE > 0.15 sustained = immediate page
Ticket (Warning): 3× burn rate over 3d
→ ECE trending toward 0.10 = create ticket
Auto-Recalibrate: ECE crosses 0.12
→ Trigger isotonic recalibration pipeline