AI Factory Ops
SCU Hack-a-Stack · Cisco AI Cluster Management
09:51:05UTC
Cluster: Online
Executive Operations Summary
PERFORMANCErolling 15m · auto-refresh 5s
Cluster Health
73
DEGRADED · trending ↓
Active Incidents
1
SEV-1 · NODE-A2
Placement Efficiency
87%
best-fit decreasing
Operational Risk
HIGH
interconnect r1 degraded
AI SRE Copilotmodel: ops-llm-7b · 87% conf

GPU saturation on R1 is driving latency instability — consider scaling R1 or rerouting 30% of inference cohort B to R3.

live analysis·rolling 5m window·insight 1 / 6
Live Event Stream2 events · streaming
[09:51:05]Cluster supervisor online · 9 nodes registered
[09:51:05]Telemetry collectors streaming · 4.2k metrics/s
TRAFFIC12.4%
7866req/s
P95 LATENCY41.2%
230.9ms
ERROR RATE58.0%
3.70%
GPU UTIL22.7%
85.5%
Telemetry · 6h rolling window
trafficlatencyerrorsgpu
Throughput Forecastarima(2,1,1) · next 15m
Burn Rate2.4× SLO
ETA to SLO~9m
Saturation86%
Headroom15%

Recommended Actions

3 ranked by impact · auto-refresh 30s
P0confidence 94%

Scale GPU pool on RACK-R1 by +4 H100 nodes

GPU saturation on R1 (94%) is the dominant driver of P95 latency drift (+41% / 1h). Autoscaler is throttled by zonal capacity quota.

Latency
-67%
Errors
-54%
Risk
HIGH→LOW
P1confidence 87%

Reroute inference traffic R1 → R3 (weighted 35%)

R3 capacity headroom 42%. Locality cost is acceptable for cohort B inference. Cuts queue pressure on R1 immediately.

Latency
-28%
Errors
-19%
Risk
HIGH→MED
P2confidence 78%

Investigate 5xx burst on inference-svc/v3

Error-rate cohort isolation indicates 71% of 5xx originate from a single deployment revision shipped 38m ago.

Latency
-8%
Errors
-62%
Risk
MED→LOW
Action Simulatorsimulated effects propagate in <1s
stabilization 0% · last action propagated through control plane