VGAC: Building a GPU Cluster Observability Platform with Predictive Queue Intelligence

Researchers and ML practitioners submitting jobs to shared GPU clusters face the same frustrating question every day: "When will my job actually run?" Cluster schedulers like Slurm and Kubernetes batch know the answer — they just don't share it in a useful way. Queue depth, node utilization, and job priority all affect wait times, but this information rarely reaches users as actionable guidance.

VGAC (Virtual GPU Allocation Controller) is the open-source observability platform I built to solve this. It transforms passive cluster monitoring into proactive scheduling intelligence through calibrated queue delay predictions. I'll be presenting VGAC at the 2026 Improving Scientific Software Conference at NCAR in Boulder, CO (April 9, 2026), where my talk was accepted with travel support.

The Problem: Unpredictable Wait Times Kill Productivity

In shared GPU clusters — whether academic HPC centers, corporate ML platforms, or cloud-based training clusters — job submission is a black box. You submit a job requesting 4x A100 GPUs, and... you wait. Maybe 2 minutes. Maybe 2 hours. The uncertainty has real costs:

Wasted researcher time — Scientists plan their day around expected job completion. A 2-hour unexpected wait means 2 hours of disrupted work, not just 2 hours of idle waiting.
Suboptimal resource planning — Without visibility into queue dynamics, teams can't make informed decisions about whether to request spot instances, wait for off-peak hours, or reduce GPU requirements.
Over-provisioning — When wait times are unpredictable, organizations hedge by buying more GPUs than they need. Better predictions enable right-sizing.
Missed deadlines — Conference submission deadlines, grant milestones, and product launches all depend on completing training runs on time.

Architecture: Three Integrated Layers

VGAC consists of three layers, each solving a distinct problem:

┌─────────────────────────────────────────────────────┐
│  Visualization Layer                                 │
│  Grafana dashboards │ REST API │ Admission Webhooks  │
├─────────────────────────────────────────────────────┤
│  Prediction Service                                  │
│  scikit-learn │ Isotonic calibration │ FastAPI        │
│  Sub-10ms inference latency                          │
├─────────────────────────────────────────────────────┤
│  Data Collection Plane                               │
│  kube-state-metrics │ dcgm-exporter │ GPU telemetry  │
│  ClickHouse (time-series) │ Redis (cache)            │
└─────────────────────────────────────────────────────┘

Layer 1: Data Collection

The foundation is comprehensive cluster observability. VGAC collects:

Job lifecycle events — Via kube-state-metrics (Kubernetes) or Slurm accounting logs. Every state transition (submitted → pending → running → completed/failed) is captured with timestamps.
GPU telemetry — Via dcgm-exporter for NVIDIA GPUs. Utilization, memory, temperature, and power consumption per GPU. This data feeds both predictions (high utilization = longer waits) and capacity planning.
Queue state snapshots — Periodic snapshots of queue depth, pending job count, and available resources by GPU type. These become the primary features for the prediction model.

All data flows into ClickHouse for time-series storage — chosen for its compression efficiency and fast analytical queries over large time ranges. Redis provides a caching layer for real-time prediction features.

Layer 2: Prediction Service

The core innovation in VGAC is the prediction service. Rather than just telling users the current queue depth (which is what most monitoring tools do), VGAC predicts whether a submitted job will wait more than a specified threshold (default: 120 seconds).

The prediction pipeline:

Feature extraction — At submission time, compute the feature vector: queue depth, pending ratio (pending jobs / total capacity), GPU type requested, time of day, day of week, historical wait distribution for similar jobs.
Classification — Logistic regression predicts P(wait > threshold). We chose logistic regression over more complex models because the feature space is small and interpretability matters (users want to know why they'll wait, not just that they will).
Isotonic calibration — Raw model probabilities are passed through an isotonic regression calibrator to produce well-calibrated confidence estimates. This is the critical step: when VGAC says "70% chance of long wait," users need to trust that estimate.

Why Calibration Matters More Than Accuracy

This is the central insight of the VGAC project, and it connects to my broader research on LLM calibration and reliability.

A model with 90% accuracy that gives 95% confidence on every prediction is less useful than a model with 80% accuracy that gives well-calibrated probabilities. Users make decisions based on predicted probabilities — "should I wait or grab lunch?" — and those decisions are only rational if the probabilities are trustworthy.

We measure calibration with Expected Calibration Error (ECE): the average gap between predicted probability and observed frequency. Our production ECE of 0.077 means predictions are, on average, within 7.7 percentage points of the true probability. For context, many ML systems have ECE > 0.15.

Production Results

We deployed VGAC on a production Amazon EKS cluster with heterogeneous GPUs (T4 and A10G), collecting 582 job lifecycle records for the initial evaluation:

Metric	Value
AUC-ROC	0.756
Expected Calibration Error	0.077
Prediction latency (p95)	< 10ms
Top feature importance	queue_depth: 0.801

The dominant feature — pending_ratio at submission time — accounts for 80.1% of feature importance. This is a key finding: minimal observability metrics enable reliable predictions. You don't need complex workload characterization or historical job profiling. The queue state at submission time captures most of the signal.

Surfacing Predictions: Three Integration Points

Predictions are useless if they don't reach users at the right time. VGAC surfaces predictions through three channels:

REST API — GET /predict?gpu_type=a10g&gpu_count=4 returns a calibrated probability and recommended action. Latency: sub-10ms. Teams integrate this into custom CLIs and job submission scripts.
Kubernetes Admission Webhook — For Kubernetes-native clusters, VGAC can intercept pod creation requests and annotate them with predicted wait times. This enables policy-based routing: if predicted wait > 10 minutes, suggest a different GPU type or node pool.
Grafana Dashboards — Real-time visualization of queue state, prediction accuracy over time, calibration drift monitoring, and historical wait time distributions. Cluster administrators use this for capacity planning; researchers use it for submission timing.

Design Principles That Emerged

After iterating through several architectures, three principles crystallized:

1. Calibration over accuracy. User trust depends on reliable probability estimates, not just correct binary predictions. We invested more engineering effort in isotonic calibration and calibration monitoring than in improving the base classifier.

2. Minimal features. The temptation is to add every available signal — job type, user history, GPU memory patterns, historical completion times. In practice, pending_ratio alone captures most of the predictive signal. More features add latency, complexity, and overfitting risk without proportional accuracy gains.

3. Seamless integration. The prediction service is only useful if it's in the user's workflow. REST endpoints for scripting, admission webhooks for automatic routing, Grafana for visual monitoring — meet users where they are.

Ongoing Work: SLO-Based Calibration Drift Monitoring

Cluster workloads change over time: new teams onboard, GPU types are added, usage patterns shift seasonally. VGAC's calibration will degrade unless it's actively monitored and recalibrated.

We're building SLO-based calibration monitoring that treats ECE as a service-level indicator (SLI) with an error budget. When calibration drift exceeds the budget, automated recalibration triggers using the latest job data. This connects to the broader SLI/SLO framework for ML reliability that I've been developing.

ISS 2026 Presentation

I'm presenting VGAC at the 2026 Improving Scientific Software Conference (ISS 2026) at NCAR in Boulder, CO on April 9, 2026. The talk covers the full architecture, production deployment lessons, and calibration methodology. The conference committee accepted the talk with travel support, which I'm deeply grateful for — it's an opportunity to share this work with the scientific computing community where GPU cluster scheduling is a daily pain point.

VGAC is open source and live at vgac.cloud.

← Back to Blog