Autonomous incident response on AWS
8 AI agents on AWS Strands Agents SDK. Reasoning by Amazon Nova. Voice calls via Amazon Connect + Nova Sonic. Storage on DynamoDB + S3. No LangChain. No OpenAI. No ElevenLabs. Pure AWS, end to end.
Pull logs. Check Kubernetes. Query Prometheus. Read recent commits. Page someone. Repeat. By the time a human understands the root cause, users have already noticed.
Under 60 seconds from alert to proposed fix. The AI never auto-executes critical actions — it explains the situation on a phone call and waits for your verbal "go ahead."
6+ vendors. 6+ API keys. 6+ bills. Fragile glue code everywhere.
One cloud. One IAM. One bill. Zero vendor stitching.
The Jury never sees the War Room's reasoning. If they disagree, execution is blocked.
Deep sequential investigation
Independent validation panel
Final decision layer
When a P1 incident hits with risk score above 85, NovaOps doesn't just send a Slack message. It places a real outbound phone call via Amazon Connect, powered by a real-time Nova AI conversation.
If the call fails? Automatic Slack critical escalation. No incident is ever dropped.
"Hey, this is NovaOps. Checkout-service just hit a P1 OOM event. Heap usage is at 97%. We're recommending an immediate pod restart. Want me to go ahead?"
"Yes, do it."
"Approved. Restarting pods now. I'll notify the team. Goodbye."
AI proposes, humans approve. Every remediation requires explicit consent — verbal on a call, a Slack button, or a dashboard click. The AI never auto-executes critical actions. Full append-only audit trail for compliance.
OOM, traffic surge, deadlock, config drift, dependency failure, cascading failure. Domain-specific playbooks with targeted investigation strategies.
Policy engine computes risk 0–100 from action weight + severity + confidence. Rollbacks always need approval. P1 always needs approval.
Auto-generated PIR with root cause analysis, timeline, and remediation steps. PDF exported to S3. Nobody writes post-mortems manually.
Unit tests for every module. Integration tests for the full critical path. E2E tests simulating real P1 OOM incidents through escalation, voice call, and Slack fallback. Mock mode works without any AWS credentials.
git clone https://github.com/sujeetmadihalli/NovaOps.git
bash start_war_room.sh
bash trigger_live_outage.sh