RESOLVED

Chaos Scenarios Ready

Historical Replays

Single Fault

Cascade

Multi-fault

Adversarial

Live Cluster Metrics

connecting… Grafana ↗

Error Rate

—

5xx / total req

CPU Usage

—

cores (default ns)

Active Alerts

—

Alertmanager

Request Rate (req/s)

CPU Cores (default ns)

—

Service Topology connecting…

MTTR — No active incident

Agent Chain Idle

🤖

Inject a chaos scenario
to start incident response

# incident-response 0 posts

No posts yet — comms agent will post here

GKE cluster online

Prometheus live

AtlasOps ready

Discord webhook …

Inference …

Qwen2.5-7B · AMD MI300X

BENCHMARK RESULTS

29 curriculum YAML scenarios · Real GKE cluster · AMD MI300X · DAPO loss + Spaced-Rep Curriculum

0.856

Best Score

Cloudflare 2019

82%

Resolution Rate

GRPO · +28pp vs zero-shot

58.9s

Avg MTTR

vs ~25 min human

Real SRE Tools

kubectl · promql · jaeger · argocd

Training Progression (AMD MI300X · Qwen2.5-7B)

Stage	Resolution	Avg Reward	Cascade	Named Replays	vs Baseline
Qwen2.5-7B zero-shot	54%	0.481	40%	30%	baseline
AtlasOps SFT	68%	0.601	62%	55%	+14pp
⭐ AtlasOps GRPO AMD MI300X	82%	0.729	78%	72%	+28pp

Quick Eval · 3 Demo Scenarios · Live GKE · May 9 2026

Scenario	Outcome	MTTR	Score
Cloudflare 2019 (CPU saturation)	✓ resolved	102.8s	0.856
GitHub 2018 (DB failover loop)	escalated	35.5s	0.548
sf-001 (OOMKill crash loop)	⚠ partial	38.3s	0.722

Reward = 15% triage + 30% diagnosis + 35% remediation + 10% comms + 10% speed. Real tool calls against real GKE cluster.

AMD Developer Hackathon 2026

ATLASOPS

The first multi-agent SRE platform trained end-to-end on real cloud failures.
4 specialized AI agents. 20 real SRE tools. 1 AMD MI300X.

HF Spaces & trained weights

This dashboard calls the coordinator configured for the deployment (URL, secrets, cluster access). Chaos and Prometheus metrics are live when that wiring reaches a real cluster; otherwise the UI can still run with a local UI preview (offline) timeline. GRPO/SFT checkpoints are not magically active just because the Space is open—using them means loading the adapter or merged weights in your inference stack (for example vLLM on ROCm) as part of the Space image or startup.

⚠ The Problem

2:47 AM

Average time a P1 alert fires. Your on-call engineer is asleep.

~25 min

Average human MTTR for a cascade incident. Revenue bleeding the whole time.

$250B

Global observability + SRE market. On-call burnout is a $250B problem.

🤖 The Solution — 4-Agent Pipeline

🔴

TRIAGE

Acks alert
Assigns severity
Maps blast radius
4 tool calls max

🔍

DIAGNOSIS

PromQL queries
Jaeger traces
kubectl logs
Root cause ID

🔧

REMEDIATION

Argo CD rollback
kubectl scale
Alert silence
Prometheus verify

📣

COMMS

Slack update
Postmortem draft
Status page
Action items

☁ Real Infrastructure

▸GKE Standardus-central1, 3× e2-standard-4

▸Online Boutique11 microservices — gRPC, Go, Python, Node

▸Prometheus + GrafanaReal scraping, 2M+ time series

▸Jaeger + OTelDistributed trace collection

▸Chaos MeshPodChaos · NetworkChaos · IOChaos · DNSChaos · TimeChaos · StressChaos

▸Argo CDGitOps rollbacks — real execution

▸Cloud SQLPostgres 15 — replaces SQLite in cartservice

▸AlertmanagerWebhook → AtlasOps coordinator

⚡ AMD MI300X Training

192 GB

HBM3 VRAM

5 Qwen models co-hosted simultaneously

▸Qwen2.5-7B × 4One per agent role + QLoRA LoRA r=16

▸Qwen2.5-72BLLM judge + adversarial scenario designer

▸QLoRA SFT4-bit NF4 quantization, TRL + PEFT

▸Online GRPODAPO loss — real GKE rollouts per step

▸CurriculumSpaced rep [3,6,12,24,48h], mastery decay=0.85

▸Dense RewardsPer-step + episode contract (70/30 blend)

▸ROCm 7.2PyTorch + Hugging Face Optimum-AMD

🛠 20 Real SRE Tools

⬡kubectlget · describe · logs · top · rollout · scale · exec

⬡promql_queryLive Prometheus HTTP API

⬡promql_query_rangeTime-series range queries

⬡jaeger_searchFind traces by service + duration

⬡jaeger_get_traceFull span chain by trace ID

⬡argocd_rollbackGitOps rollback via REST API

⬡argocd_list_appsList all deployed applications

⬡alertmanager_silenceSuppress flapping alerts via API

⬡gcloud_logs_readCloud Logging structured logs

⬡cloud_monitoring_queryGCP-native metrics API

⬡slack_post_updateIncident channel notification

⬡postmortem_draftAuto-generated Cloudflare-quality postmortem

🛡 Safety Architecture — No Agent Can Cause an Outage

🚦

Approval Gate

P0 → human required
P1 → 60s auto window
P2/P3 → automatic
Token-based callbacks

⚡

Circuit Breaker

50 tool calls/incident
10 mutations/hour
3 consecutive failures → OPEN
Auto-reset on recovery

🔗

Incident Correlator

5-min dedup window
Prevents alert storms
Fingerprint matching
Active incident tracking

📋

HMAC Audit Log

Hash-chained entries
Tamper-evident
Every tool call logged
Cryptographic integrity

📖 31 Scenarios — Including 10 Named Historical Replays

🌩

Cloudflare 2019

Regex CPU storm · 85% traffic down

🔄

GitHub 2018

DB failover loop · 24h incident

⚡

Discord 2022

Redis thundering herd

🌐

Datadog 2023

systemd-resolved failure

💬

Slack 2022

HTTP/2 stream exhaustion

📦

AWS S3 2017

Typo capacity command

🌍

Azure DNS 2019

Stale DNS cache poisoning

🔌

Fastly 2021

Bad VCL/Envoy filter

📡

Facebook BGP 2021

Control plane partition

💰

Knight Capital 2012

Partial deploy mismatch

🐙

GitHub Repository

Harikishanth/AtlasOps · MIT License

📊

Live Grafana

Real GKE metrics · us-central1

🛍

Online Boutique

Live target application · GKE

Built by Hari Kishanth · Team Da Big Three · St. Joseph's College of Engineering, Chennai

AMD Developer Hackathon 2026 · lablab.ai · ⚡ Powered by AMD MI300X 192GB HBM3