RESOLVED
Chaos Scenarios Ready
Live Cluster Metrics
connecting… Grafana ↗
Error Rate
5xx / total req
CPU Usage
cores (default ns)
Active Alerts
Alertmanager
Request Rate (req/s)
CPU Cores (default ns)
Service Topology connecting…
FE frontend ADS adservice CART cartservice CHK checkout CAT catalog REC rec RDB redis PAY payment CUR currency SHP shipping EML email
MTTR No active incident
Agent Chain Idle
🤖

Inject a chaos scenario
to start incident response

# incident-response 0 posts
No posts yet — comms agent will post here
GKE cluster online
Prometheus live
AtlasOps ready
Discord webhook …
Inference …
Qwen2.5-7B · AMD MI300X

BENCHMARK RESULTS

29 curriculum YAML scenarios · Real GKE cluster · AMD MI300X · DAPO loss + Spaced-Rep Curriculum

0.856
Best Score
Cloudflare 2019
82%
Resolution Rate
GRPO · +28pp vs zero-shot
58.9s
Avg MTTR
vs ~25 min human
20
Real SRE Tools
kubectl · promql · jaeger · argocd
Training Progression (AMD MI300X · Qwen2.5-7B)
StageResolutionAvg RewardCascadeNamed Replaysvs Baseline
Qwen2.5-7B zero-shot
54%
0.48140%30%baseline
AtlasOps SFT
68%
0.60162%55%+14pp
⭐ AtlasOps GRPO AMD MI300X
82%
0.72978%72%+28pp
Quick Eval · 3 Demo Scenarios · Live GKE · May 9 2026
ScenarioOutcomeMTTRScore
Cloudflare 2019 (CPU saturation)✓ resolved102.8s0.856
GitHub 2018 (DB failover loop)escalated35.5s0.548
sf-001 (OOMKill crash loop)⚠ partial38.3s0.722

Reward = 15% triage + 30% diagnosis + 35% remediation + 10% comms + 10% speed. Real tool calls against real GKE cluster.

AMD Developer Hackathon 2026

ATLASOPS

The first multi-agent SRE platform trained end-to-end on real cloud failures.
4 specialized AI agents. 20 real SRE tools. 1 AMD MI300X.

HF Spaces & trained weights

This dashboard calls the coordinator configured for the deployment (URL, secrets, cluster access). Chaos and Prometheus metrics are live when that wiring reaches a real cluster; otherwise the UI can still run with a local UI preview (offline) timeline. GRPO/SFT checkpoints are not magically active just because the Space is open—using them means loading the adapter or merged weights in your inference stack (for example vLLM on ROCm) as part of the Space image or startup.

⚠ The Problem
2:47 AM
Average time a P1 alert fires. Your on-call engineer is asleep.
~25 min
Average human MTTR for a cascade incident. Revenue bleeding the whole time.
$250B
Global observability + SRE market. On-call burnout is a $250B problem.
🤖 The Solution — 4-Agent Pipeline
🔴
TRIAGE
Acks alert
Assigns severity
Maps blast radius
4 tool calls max
🔍
DIAGNOSIS
PromQL queries
Jaeger traces
kubectl logs
Root cause ID
🔧
REMEDIATION
Argo CD rollback
kubectl scale
Alert silence
Prometheus verify
📣
COMMS
Slack update
Postmortem draft
Status page
Action items
☁ Real Infrastructure
GKE Standardus-central1, 3× e2-standard-4
Online Boutique11 microservices — gRPC, Go, Python, Node
Prometheus + GrafanaReal scraping, 2M+ time series
Jaeger + OTelDistributed trace collection
Chaos MeshPodChaos · NetworkChaos · IOChaos · DNSChaos · TimeChaos · StressChaos
Argo CDGitOps rollbacks — real execution
Cloud SQLPostgres 15 — replaces SQLite in cartservice
AlertmanagerWebhook → AtlasOps coordinator
⚡ AMD MI300X Training
192 GB
HBM3 VRAM
5 Qwen models co-hosted simultaneously
Qwen2.5-7B × 4One per agent role + QLoRA LoRA r=16
Qwen2.5-72BLLM judge + adversarial scenario designer
QLoRA SFT4-bit NF4 quantization, TRL + PEFT
Online GRPODAPO loss — real GKE rollouts per step
CurriculumSpaced rep [3,6,12,24,48h], mastery decay=0.85
Dense RewardsPer-step + episode contract (70/30 blend)
ROCm 7.2PyTorch + Hugging Face Optimum-AMD
🛠 20 Real SRE Tools
kubectlget · describe · logs · top · rollout · scale · exec
promql_queryLive Prometheus HTTP API
promql_query_rangeTime-series range queries
jaeger_searchFind traces by service + duration
jaeger_get_traceFull span chain by trace ID
argocd_rollbackGitOps rollback via REST API
argocd_list_appsList all deployed applications
alertmanager_silenceSuppress flapping alerts via API
gcloud_logs_readCloud Logging structured logs
cloud_monitoring_queryGCP-native metrics API
slack_post_updateIncident channel notification
postmortem_draftAuto-generated Cloudflare-quality postmortem
🛡 Safety Architecture — No Agent Can Cause an Outage
🚦
Approval Gate
P0 → human required
P1 → 60s auto window
P2/P3 → automatic
Token-based callbacks
Circuit Breaker
50 tool calls/incident
10 mutations/hour
3 consecutive failures → OPEN
Auto-reset on recovery
🔗
Incident Correlator
5-min dedup window
Prevents alert storms
Fingerprint matching
Active incident tracking
📋
HMAC Audit Log
Hash-chained entries
Tamper-evident
Every tool call logged
Cryptographic integrity
📖 31 Scenarios — Including 10 Named Historical Replays
🌩
Cloudflare 2019
Regex CPU storm · 85% traffic down
🔄
GitHub 2018
DB failover loop · 24h incident
Discord 2022
Redis thundering herd
🌐
Datadog 2023
systemd-resolved failure
💬
Slack 2022
HTTP/2 stream exhaustion
📦
AWS S3 2017
Typo capacity command
🌍
Azure DNS 2019
Stale DNS cache poisoning
🔌
Fastly 2021
Bad VCL/Envoy filter
📡
Facebook BGP 2021
Control plane partition
💰
Knight Capital 2012
Partial deploy mismatch
🐙
GitHub Repository
Harikishanth/AtlasOps · MIT License
📊
Live Grafana
Real GKE metrics · us-central1
🛍
Online Boutique
Live target application · GKE
Built by Hari Kishanth · Team Da Big Three · St. Joseph's College of Engineering, Chennai
AMD Developer Hackathon 2026 · lablab.ai · ⚡ Powered by AMD MI300X 192GB HBM3