Autonomous incident response on AWS

The SRE that investigates, validates, and calls you on the phone.

8 AI agents on AWS Strands Agents SDK. Reasoning by Amazon Nova. Voice calls via Amazon Connect + Nova Sonic. Storage on DynamoDB + S3. No LangChain. No OpenAI. No ElevenLabs. Pure AWS, end to end.

novaops — live
100% AWS
The problem

Incident at 3 AM.
45 minutes just to understand what broke.

Pull logs. Check Kubernetes. Query Prometheus. Read recent commits. Page someone. Repeat. By the time a human understands the root cause, users have already noticed.

NovaOps

8 AI agents investigate.
A jury validates.
Your phone rings.

Under 60 seconds from alert to proposed fix. The AI never auto-executes critical actions — it explains the situation on a phone call and waits for your verbal "go ahead."

Why it matters

Everyone else stitches 5 vendors together.
NovaOps uses one cloud.

Typical AI agent stack

Framework LangChain / LangGraph / CrewAI / Letta
LLMs OpenAI / Claude / Kimi / Minimax
Voice AI ElevenLabs / Sarvam AI / Deepgram
Telephony Twilio / Vonage / Plivo
Storage MongoDB / Postgres / Pinecone
Compute Vercel / Railway / Heroku

6+ vendors. 6+ API keys. 6+ bills. Fragile glue code everywhere.

NovaOps — 100% AWS

Framework AWS Strands Agents SDK
LLMs Amazon Nova 2 Lite via Amazon Bedrock
Voice AI Amazon Nova Sonic via Amazon Bedrock
Telephony Amazon Connect + Amazon Lex V2 + AWS Lambda
Storage Amazon DynamoDB + Amazon S3
Compute AWS Lambda + Amazon ECS / EC2

One cloud. One IAM. One bill. Zero vendor stitching.

Architecture

Three stages. Two independent pipelines. One governance gate.

The Jury never sees the War Room's reasoning. If they disagree, execution is blocked.

01

War Room

Deep sequential investigation

  • Triage Agent classifies domain & severity
  • 4 analysts run in parallel — logs, metrics, K8s, GitHub
  • Root Cause Reasoner synthesizes findings
  • Critic Agent adversarially reviews (max 3 loops)
  • Remediation Planner proposes action
02

Blind Jury

Independent validation panel

  • 4 specialist jurors receive only raw context
  • Parallel deliberation with timeout isolation
  • Judge LLM synthesizes verdicts
  • 6-check Escalation Gate catches edge cases
  • No War Room influence — prevents groupthink
03

Convergence & Governance

Final decision layer

  • Compares War Room vs Jury proposed actions
  • Agreement: confidence +0.15
  • Disagreement: confidence −0.30, forced approval
  • Policy engine scores risk 0–100
  • Critical incidents trigger outbound voice call
New feature

Your phone rings.
Nova explains the incident.
You say "go ahead."

When a P1 incident hits with risk score above 85, NovaOps doesn't just send a Slack message. It places a real outbound phone call via Amazon Connect, powered by a real-time Nova AI conversation.

1
Escalation policy triggers Severity P1 + risk ≥ 85
2
Amazon Connect dials on-call Real outbound phone call
3
Nova AI explains the situation Real-time conversation via Lex + Lambda + Bedrock
4
"Yes, go ahead" Verbal approval triggers remediation through governance gate

If the call fails? Automatic Slack critical escalation. No incident is ever dropped.

9:41
N
NovaOps SRE
00:00

"Hey, this is NovaOps. Checkout-service just hit a P1 OOM event. Heap usage is at 97%. We're recommending an immediate pod restart. Want me to go ahead?"

"Yes, do it."

"Approved. Restarting pods now. I'll notify the team. Goodbye."

Capabilities

Everything an SRE team needs. Fully autonomous.

Ghost Mode

AI proposes, humans approve. Every remediation requires explicit consent — verbal on a call, a Slack button, or a dashboard click. The AI never auto-executes critical actions. Full append-only audit trail for compliance.

6 Failure Domains

OOM, traffic surge, deadlock, config drift, dependency failure, cascading failure. Domain-specific playbooks with targeted investigation strategies.

Risk Scoring

Policy engine computes risk 0–100 from action weight + severity + confidence. Rollbacks always need approval. P1 always needs approval.

Post-Incident Reports

Auto-generated PIR with root cause analysis, timeline, and remediation steps. PDF exported to S3. Nobody writes post-mortems manually.

123 Tests. Full Coverage.

Unit tests for every module. Integration tests for the full critical path. E2E tests simulating real P1 OOM incidents through escalation, voice call, and Slack fallback. Mock mode works without any AWS credentials.

Built entirely on AWS

Every layer is AWS. Here's the full stack.

AI & Agents

SA
AWS Strands Agents SDK
Multi-agent orchestration framework — War Room graph, tool use, parallel dispatch
Nova
Amazon Nova 2 Lite
LLM inference for all 8 agents, jury, critic, and governance reasoning
Sonic
Amazon Nova Sonic
Real-time voice AI for conversational phone calls with on-call engineers
BR
Amazon Bedrock
Managed foundation model platform — single API for all Nova models

Voice & Telephony

CN
Amazon Connect
Outbound phone calls to on-call engineers via StartOutboundVoiceContact
Lex
Amazon Lex V2
Speech-to-text transcription and conversation turn management
Polly
Amazon Polly
Text-to-speech for incident briefing playback in Contact Flow
λ
AWS Lambda
Serverless Lex fulfillment — bridges speech to Nova real-time conversation

Data & Infrastructure

DB
Amazon DynamoDB
Incident history, status tracking, PIR storage
S3
Amazon S3
Post-Incident Report PDFs with presigned download URLs
KB
Bedrock Knowledge Bases
RAG retrieval from runbooks and past incident learnings
LS
LocalStack
Local dev emulation of DynamoDB + S3 — zero AWS spend for testing

Evaluate the system in 3 steps.

git clone https://github.com/sujeetmadihalli/NovaOps.git bash start_war_room.sh bash trigger_live_outage.sh
Get started