AI Factory Ops | SCU Hack-a-Stack

Executive Operations Summary

PERFORMANCErolling 15m · auto-refresh 5s

Cluster Health

DEGRADED · trending ↓

Active Incidents

SEV-1 · NODE-A2

Placement Efficiency

87%

best-fit decreasing

Operational Risk

HIGH

interconnect r1 degraded

AI SRE Copilotmodel: ops-llm-7b · 87% conf

GPU saturation on R1 is driving latency instability — consider scaling R1 or rerouting 30% of inference cohort B to R3.

live analysis·rolling 5m window·insight 1 / 6

Live Event Stream2 events · streaming

[09:51:05]Cluster supervisor online · 9 nodes registered

[09:51:05]Telemetry collectors streaming · 4.2k metrics/s

TRAFFIC12.4%

7866req/s

P95 LATENCY41.2%

230.9ms

ERROR RATE58.0%

3.70%

GPU UTIL22.7%

85.5%

Telemetry · 6h rolling window

trafficlatencyerrorsgpu

Throughput Forecastarima(2,1,1) · next 15m

Burn Rate2.4× SLO

ETA to SLO~9m

Saturation86%

Headroom15%

Recommended Actions

3 ranked by impact · auto-refresh 30s

P0confidence 94%

Scale GPU pool on RACK-R1 by +4 H100 nodes

GPU saturation on R1 (94%) is the dominant driver of P95 latency drift (+41% / 1h). Autoscaler is throttled by zonal capacity quota.

Latency

-67%

Errors

-54%

Risk

HIGH→LOW

P1confidence 87%

Reroute inference traffic R1 → R3 (weighted 35%)

R3 capacity headroom 42%. Locality cost is acceptable for cohort B inference. Cuts queue pressure on R1 immediately.

Latency

-28%

Errors

-19%

Risk

HIGH→MED

P2confidence 78%

Investigate 5xx burst on inference-svc/v3

Error-rate cohort isolation indicates 71% of 5xx originate from a single deployment revision shipped 38m ago.

Latency

-8%

Errors

-62%

Risk

MED→LOW

Action Simulatorsimulated effects propagate in <1s

stabilization 0% · last action propagated through control plane