Inject a chaos scenario
to start incident response
Inject a chaos scenario
to start incident response
29 curriculum YAML scenarios · Real GKE cluster · AMD MI300X · DAPO loss + Spaced-Rep Curriculum
| Stage | Resolution | Avg Reward | Cascade | Named Replays | vs Baseline |
|---|---|---|---|---|---|
| Qwen2.5-7B zero-shot |
54%
|
0.481 | 40% | 30% | baseline |
| AtlasOps SFT |
68%
|
0.601 | 62% | 55% | +14pp |
| ⭐ AtlasOps GRPO AMD MI300X |
82%
|
0.729 | 78% | 72% | +28pp |
| Scenario | Outcome | MTTR | Score |
|---|---|---|---|
| Cloudflare 2019 (CPU saturation) | ✓ resolved | 102.8s | 0.856 |
| GitHub 2018 (DB failover loop) | escalated | 35.5s | 0.548 |
| sf-001 (OOMKill crash loop) | ⚠ partial | 38.3s | 0.722 |
Reward = 15% triage + 30% diagnosis + 35% remediation + 10% comms + 10% speed. Real tool calls against real GKE cluster.
The first multi-agent SRE platform trained end-to-end on real cloud failures.
4 specialized AI agents. 20 real SRE tools. 1 AMD MI300X.
This dashboard calls the coordinator configured for the deployment (URL, secrets, cluster access). Chaos and Prometheus metrics are live when that wiring reaches a real cluster; otherwise the UI can still run with a local UI preview (offline) timeline. GRPO/SFT checkpoints are not magically active just because the Space is open—using them means loading the adapter or merged weights in your inference stack (for example vLLM on ROCm) as part of the Space image or startup.