Researchers and ML practitioners submitting jobs to shared GPU clusters face the same frustrating question every day: "When will my job actually run?" Cluster schedulers like Slurm and Kubernetes batch know the answer — they just don't share it in a useful way. Queue depth, node utilization, and job priority all affect wait times, but this information rarely reaches users as actionable guidance.
VGAC (Virtual GPU Allocation Controller) is the open-source observability platform I built to solve this. It transforms passive cluster monitoring into proactive scheduling intelligence through calibrated queue delay predictions. I'll be presenting VGAC at the 2026 Improving Scientific Software Conference at NCAR in Boulder, CO (April 9, 2026), where my talk was accepted with travel support.
In shared GPU clusters — whether academic HPC centers, corporate ML platforms, or cloud-based training clusters — job submission is a black box. You submit a job requesting 4x A100 GPUs, and... you wait. Maybe 2 minutes. Maybe 2 hours. The uncertainty has real costs:
VGAC consists of three layers, each solving a distinct problem:
┌─────────────────────────────────────────────────────┐
│ Visualization Layer │
│ Grafana dashboards │ REST API │ Admission Webhooks │
├─────────────────────────────────────────────────────┤
│ Prediction Service │
│ scikit-learn │ Isotonic calibration │ FastAPI │
│ Sub-10ms inference latency │
├─────────────────────────────────────────────────────┤
│ Data Collection Plane │
│ kube-state-metrics │ dcgm-exporter │ GPU telemetry │
│ ClickHouse (time-series) │ Redis (cache) │
└─────────────────────────────────────────────────────┘
The foundation is comprehensive cluster observability. VGAC collects:
All data flows into ClickHouse for time-series storage — chosen for its compression efficiency and fast analytical queries over large time ranges. Redis provides a caching layer for real-time prediction features.
The core innovation in VGAC is the prediction service. Rather than just telling users the current queue depth (which is what most monitoring tools do), VGAC predicts whether a submitted job will wait more than a specified threshold (default: 120 seconds).
The prediction pipeline:
This is the central insight of the VGAC project, and it connects to my broader research on LLM calibration and reliability.
A model with 90% accuracy that gives 95% confidence on every prediction is less useful than a model with 80% accuracy that gives well-calibrated probabilities. Users make decisions based on predicted probabilities — "should I wait or grab lunch?" — and those decisions are only rational if the probabilities are trustworthy.
We measure calibration with Expected Calibration Error (ECE): the average gap between predicted probability and observed frequency. Our production ECE of 0.077 means predictions are, on average, within 7.7 percentage points of the true probability. For context, many ML systems have ECE > 0.15.
We deployed VGAC on a production Amazon EKS cluster with heterogeneous GPUs (T4 and A10G), collecting 582 job lifecycle records for the initial evaluation:
| Metric | Value |
|---|---|
| AUC-ROC | 0.756 |
| Expected Calibration Error | 0.077 |
| Prediction latency (p95) | < 10ms |
| Top feature importance | queue_depth: 0.801 |
The dominant feature — pending_ratio at submission time — accounts for 80.1% of feature importance. This is a key finding: minimal observability metrics enable reliable predictions. You don't need complex workload characterization or historical job profiling. The queue state at submission time captures most of the signal.
Predictions are useless if they don't reach users at the right time. VGAC surfaces predictions through three channels:
GET /predict?gpu_type=a10g&gpu_count=4 returns a calibrated probability and recommended action. Latency: sub-10ms. Teams integrate this into custom CLIs and job submission scripts.After iterating through several architectures, three principles crystallized:
1. Calibration over accuracy. User trust depends on reliable probability estimates, not just correct binary predictions. We invested more engineering effort in isotonic calibration and calibration monitoring than in improving the base classifier.
2. Minimal features. The temptation is to add every available signal — job type, user history, GPU memory patterns, historical completion times. In practice, pending_ratio alone captures most of the predictive signal. More features add latency, complexity, and overfitting risk without proportional accuracy gains.
3. Seamless integration. The prediction service is only useful if it's in the user's workflow. REST endpoints for scripting, admission webhooks for automatic routing, Grafana for visual monitoring — meet users where they are.
Cluster workloads change over time: new teams onboard, GPU types are added, usage patterns shift seasonally. VGAC's calibration will degrade unless it's actively monitored and recalibrated.
We're building SLO-based calibration monitoring that treats ECE as a service-level indicator (SLI) with an error budget. When calibration drift exceeds the budget, automated recalibration triggers using the latest job data. This connects to the broader SLI/SLO framework for ML reliability that I've been developing.
I'm presenting VGAC at the 2026 Improving Scientific Software Conference (ISS 2026) at NCAR in Boulder, CO on April 9, 2026. The talk covers the full architecture, production deployment lessons, and calibration methodology. The conference committee accepted the talk with travel support, which I'm deeply grateful for — it's an opportunity to share this work with the scientific computing community where GPU cluster scheduling is a daily pain point.
VGAC is open source and live at vgac.cloud.