DevopsAgent watches your Jenkins, GitHub Actions, ArgoCD, Kubernetes and cloud stack — diagnoses failures, decides what to do, executes safe fixes, and learns from every incident. Open source. Self-hosted. Provider-agnostic LLMs.
# 2025-11-15 14:02:11 jenkins-job: deploy-prod #482 [detect] pipeline failed: build step exited 137 (OOMKilled) [plan] similar incident #475 resolved by raising heap → confidence 0.83 [decide] threshold 0.75 met → AUTO_REMEDIATE [execute] patched Jenkinsfile: -Xmx 2g → 4g [verify] re-run #483 ... SUCCESS in 4m 12s [learn] stored embedding + outcome in memory (sqlite)
DevopsAgent doesn't replace your CI/CD — it sits beside it, watches outcomes, and acts only when it's confident enough to.
Pipeline failures, OOMKills, deployment rollbacks, security findings and synthetic alerts all flow through structured failure contexts.
A safety layer scores each suggested fix against a confidence threshold and per-agent policy before anything mutates production.
Specialised executors retry Jenkins jobs, restart pods, replay GitHub Actions, re-deploy Argo apps, run Ansible playbooks or open a ticket.
Every outcome is embedded and stored, so the next similar failure is matched semantically — faster, cheaper, more confident.
Slack, Teams, Email or PagerDuty get a clean, explainable timeline of what happened, why, and what changed.
Every decision, prompt, response and side-effect is logged — fully replayable and reviewable from the bundled web UI.
Every plan lands in one of three buckets — driven by per-agent confidence thresholds, allow/deny lists, and rate limits. No silent magic.
Plan score ≥ threshold and matches an allowed action class. The agent executes, verifies, then notifies. Zero human latency.
Score is meaningful but not high enough — the agent posts a one-click suggestion to your inbox with full reasoning attached.
Unknown failure class, sensitive scope (prod secrets, IAM, DB migrations) or rate-limit tripped — straight to on-call with context bundle.
A modular agent framework with pluggable LLMs, structured memory, and executors for the tools your team already uses.
Ollama, OpenAI, Anthropic, Gemini and any OpenAI-compatible endpoint — one config switch, no vendor lock-in.
Drop a Python file in agents/plugins/ and the registry auto-loads it. Build your own integration in minutes.
SQLite + embeddings recall similar past failures so the agent gets faster, cheaper, and more confident over time.
Per-agent confidence thresholds, allow/deny lists and rate limits gate every action before anything mutates real infra.
Jenkins retry, GitHub Actions replay, Kubernetes pod restart, Docker restart, Ansible, Terraform, Git revert and more.
PHP dashboard with run history, metrics, costs, ROI, webhook log, super-admin and pipeline timeline — out of the box.
Slack, Teams, Email and PagerDuty channels for suggestions, auto-fixes and escalations — with full reasoning attached.
Prometheus-ready exporter, anomaly detection and a built-in ROI calculator so you can prove value to your CFO.
Every prompt, response, decision and side-effect is logged in a replayable timeline. SOC-friendly by design.
Sources feed structured failure contexts into a pluggable agent framework that emits decisions through a safety layer to specialised executors.
┌──────────────────────────────────────────────────────────────────────┐
│ SOURCES │
│ Jenkins · GitHub Actions · ArgoCD · Kubernetes · Cloud │
│ Webhooks · Log files · XLS scan · Synthetic alerts │
└────────────────────────────┬─────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────────────┐
│ AGENT FRAMEWORK │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ Extractor │ → │ Decision │ → │ Executor │ │
│ │ (LLM call) │ │ (thresh.) │ │ (per tool) │ │
│ └─────────────┘ └─────────────┘ └──────────────┘ │
│ ▲ │ │ │
│ │ ▼ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Memory · Audit · Metrics · Notifications │ │
│ │ (SQLite + embeddings + Prometheus + Slack) │ │
│ └─────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
│
▼
Web UI · CLI · Grafana
Production-ready executors today, more shipping on the roadmap below.
Run the published image from GHCR, point it at your LLM provider, and tail your first remediation.
# 1. pull the published image docker pull your-registry/your-image:latest # 2. run with your preferred LLM provider docker run -d --name devopsagent \ -p 8080:8080 \ -e OLLAMA_HOST=http://host.docker.internal:11434 \ -v devopsagent-data:/app/store \ your-registry/your-image:latest # 3. open the web UI open http://localhost:8080 # or run a one-shot CLI loop docker exec -it devopsagent python run_agents.py --once
Filter by area — every item is also tracked in docs/req1.md with implementation notes.
Open source, MIT licensed, runs in one container. Start with shadow mode and let the agent earn its trust.