Autonomous incident response on AWS

The SRE that investigates, validates, and calls you on the phone.

8 AI agents on AWS Strands Agents SDK. Reasoning by Amazon Nova. Voice calls via Amazon Connect + Nova Sonic. Storage on DynamoDB + S3. No LangChain. No OpenAI. No ElevenLabs. Pure AWS, end to end.

View repository See architecture

novaops — live

100% AWS

The problem

Incident at 3 AM.
45 minutes just to understand what broke.

Pull logs. Check Kubernetes. Query Prometheus. Read recent commits. Page someone. Repeat. By the time a human understands the root cause, users have already noticed.

NovaOps

8 AI agents investigate.
A jury validates.
Your phone rings.

Under 60 seconds from alert to proposed fix. The AI never auto-executes critical actions — it explains the situation on a phone call and waits for your verbal "go ahead."

Why it matters

Everyone else stitches 5 vendors together.
NovaOps uses one cloud.

Typical AI agent stack

Framework LangChain / LangGraph / CrewAI / Letta

LLMs OpenAI / Claude / Kimi / Minimax

Voice AI ElevenLabs / Sarvam AI / Deepgram

Telephony Twilio / Vonage / Plivo

Storage MongoDB / Postgres / Pinecone

Compute Vercel / Railway / Heroku

6+ vendors. 6+ API keys. 6+ bills. Fragile glue code everywhere.

NovaOps — 100% AWS

Framework AWS Strands Agents SDK

LLMs Amazon Nova 2 Lite via Amazon Bedrock

Voice AI Amazon Nova Sonic via Amazon Bedrock

Telephony Amazon Connect + Amazon Lex V2 + AWS Lambda

Storage Amazon DynamoDB + Amazon S3

Compute AWS Lambda + Amazon ECS / EC2

One cloud. One IAM. One bill. Zero vendor stitching.

Architecture

Three stages. Two independent pipelines. One governance gate.

The Jury never sees the War Room's reasoning. If they disagree, execution is blocked.

War Room

Deep sequential investigation

Triage Agent classifies domain & severity
4 analysts run in parallel — logs, metrics, K8s, GitHub
Root Cause Reasoner synthesizes findings
Critic Agent adversarially reviews (max 3 loops)
Remediation Planner proposes action

Blind Jury

Independent validation panel

4 specialist jurors receive only raw context
Parallel deliberation with timeout isolation
Judge LLM synthesizes verdicts
6-check Escalation Gate catches edge cases
No War Room influence — prevents groupthink

Convergence & Governance

Final decision layer

Compares War Room vs Jury proposed actions
Agreement: confidence +0.15
Disagreement: confidence −0.30, forced approval
Policy engine scores risk 0–100
Critical incidents trigger outbound voice call

New feature

Your phone rings.
Nova explains the incident.
You say "go ahead."

When a P1 incident hits with risk score above 85, NovaOps doesn't just send a Slack message. It places a real outbound phone call via Amazon Connect, powered by a real-time Nova AI conversation.

Escalation policy triggers Severity P1 + risk ≥ 85

Amazon Connect dials on-call Real outbound phone call

Nova AI explains the situation Real-time conversation via Lex + Lambda + Bedrock

"Yes, go ahead" Verbal approval triggers remediation through governance gate

If the call fails? Automatic Slack critical escalation. No incident is ever dropped.

9:41

NovaOps SRE

00:00

"Hey, this is NovaOps. Checkout-service just hit a P1 OOM event. Heap usage is at 97%. We're recommending an immediate pod restart. Want me to go ahead?"

"Yes, do it."

"Approved. Restarting pods now. I'll notify the team. Goodbye."

Capabilities

Everything an SRE team needs. Fully autonomous.

Ghost Mode

AI proposes, humans approve. Every remediation requires explicit consent — verbal on a call, a Slack button, or a dashboard click. The AI never auto-executes critical actions. Full append-only audit trail for compliance.

6 Failure Domains

OOM, traffic surge, deadlock, config drift, dependency failure, cascading failure. Domain-specific playbooks with targeted investigation strategies.

Risk Scoring

Policy engine computes risk 0–100 from action weight + severity + confidence. Rollbacks always need approval. P1 always needs approval.

Post-Incident Reports

Auto-generated PIR with root cause analysis, timeline, and remediation steps. PDF exported to S3. Nobody writes post-mortems manually.

123 Tests. Full Coverage.

Unit tests for every module. Integration tests for the full critical path. E2E tests simulating real P1 OOM incidents through escalation, voice call, and Slack fallback. Mock mode works without any AWS credentials.

Built entirely on AWS

Every layer is AWS. Here's the full stack.

AI & Agents

AWS Strands Agents SDK
Multi-agent orchestration framework — War Room graph, tool use, parallel dispatch

Nova

Amazon Nova 2 Lite
LLM inference for all 8 agents, jury, critic, and governance reasoning

Sonic

Amazon Nova Sonic
Real-time voice AI for conversational phone calls with on-call engineers

Amazon Bedrock
Managed foundation model platform — single API for all Nova models

Voice & Telephony

Amazon Connect
Outbound phone calls to on-call engineers via StartOutboundVoiceContact

Lex

Amazon Lex V2
Speech-to-text transcription and conversation turn management

Polly

Amazon Polly
Text-to-speech for incident briefing playback in Contact Flow

AWS Lambda
Serverless Lex fulfillment — bridges speech to Nova real-time conversation

Data & Infrastructure

Amazon DynamoDB
Incident history, status tracking, PIR storage

Amazon S3
Post-Incident Report PDFs with presigned download URLs

Bedrock Knowledge Bases
RAG retrieval from runbooks and past incident learnings

LocalStack
Local dev emulation of DynamoDB + S3 — zero AWS spend for testing

Evaluate the system in 3 steps.

git clone https://github.com/sujeetmadihalli/NovaOps.git bash start_war_room.sh bash trigger_live_outage.sh

Get started