Architecture
Kure Monitor ships five components that together provide failure detection, security scanning, and AI-powered troubleshooting.
System overview
Section titled “System overview” ┌─────────────────────────┐ │ LLM Providers │ │ OpenAI / Anthropic / │ │ Groq / Gemini / │ │ Copilot / Ollama │ └───────────▲─────────────┘ │┌─────────────────┐ ││ Kure Agent │ ┌──────────────────────────────────────┐│ (DaemonSet) │───>│ Kure Backend ││ │ │ (FastAPI) ││ - Pod Monitor │ │ ││ │ │ - REST API │└─────────────────┘ │ - WebSocket Server │ │ │ - Solution Engine │ │ │ - LLM Integration │ │ └──────────────────────────────────────┘ │ │ │ v v v┌─────────────────┐ ┌──────────────┐ ┌──────────────────┐│ Kubernetes │ │ PostgreSQL │ │ Kure Frontend ││ API Server │ │ Database │ │ (React) │└─────────────────┘ └──────────────┘ └──────────────────┘ ^ │ │ │┌─────────────────┐ ││ Security Scanner│ ││ (Deployment) │────────────────────────────────┘└─────────────────┘ WebSocket UpdatesComponents
Section titled “Components”Agent (DaemonSet)
Section titled “Agent (DaemonSet)”Watches the Kubernetes API for pod failures and reports them to the backend.
- Stack: Python 3.11, Kubernetes Python client, asyncio, aiohttp
- Runs: one pod per node
- Auth:
X-Service-Tokento backend
Sample report:
{ "name": "failing-pod", "namespace": "default", "reason": "CrashLoopBackOff", "message": "Back-off restarting failed container", "events": [...], "logs": "Error: Cannot connect to database...", "manifest": {...}, "container_statuses": [...]}Backend (Deployment)
Section titled “Backend (Deployment)”The brain. Receives reports, generates AI solutions, stores results, broadcasts to the frontend.
- Stack: Python 3.11, FastAPI, Pydantic, asyncpg, WebSockets
- REST API surface (abridged):
/api├── /auth # status, login, signup, me├── /pods # failures, history, ignored, logs, mirror├── /mirror # preview, deploy, status, delete, active├── /security # findings, fixes, trusted registries├── /diagram # namespaces, namespace graph, workload graph, manifest├── /admin│ ├── /llm # LLM config + test│ ├── /excluded-* # suppressions│ ├── /notifications # Slack / Teams│ └── /settings/* # retention, mirror TTL└── /ws # WebSocketSee the full API Reference.
Frontend (Deployment)
Section titled “Frontend (Deployment)”React 18 + Tailwind dashboard. Real-time updates via WebSocket.
| Route | Purpose |
|---|---|
/login | Sign-in / initial admin setup |
/ | Pod failures dashboard |
/security | Security findings |
/diagram | Topology graph (2.3.2+; RBAC mode added in 2.3.3) |
/admin | Admin panel |
Security Scanner (Deployment)
Section titled “Security Scanner (Deployment)”Audits all pods for security misconfigurations on a schedule and on demand. Reports findings to the backend with X-Service-Token.
PostgreSQL (StatefulSet)
Section titled “PostgreSQL (StatefulSet)”Persistent storage for failures, security findings, admin settings, exclusions.
| Table | Description |
|---|---|
failed_pods | Pod failure records with solutions |
security_issues | Security scan results |
admin_config | LLM and notification settings |
pod_exclusions | Excluded pod patterns |
namespace_exclusions | Excluded namespaces |
users | Dashboard accounts |
Data flows
Section titled “Data flows”Pod failure detection
Section titled “Pod failure detection”1. Pod enters failure state2. Agent detects via watch API3. Agent collects events, logs, manifest, container statuses4. Agent POSTs /api/pods/failed (X-Service-Token)5. Backend checks exclusion rules6. Backend generates solution (LLM or rule-based fallback)7. Backend stores in PostgreSQL8. Backend broadcasts via WebSocket9. Frontend renders the failure cardSecurity scanning
Section titled “Security scanning”1. Scanner lists all pods2. Runs security checks per pod3. POSTs findings to backend4. Backend stores in PostgreSQL5. Frontend displays in Security tabMirror pod testing
Section titled “Mirror pod testing”1. User clicks "Test Fix" on a failing pod2. Frontend → POST /api/mirror/preview/{pod_id}3. Backend fetches failure context, calls LLM, returns fixed manifest + explanation4. (Optional) user edits manifest5. User clicks Deploy → POST /api/mirror/deploy/{pod_id}6. Backend creates mirror pod via K8s API: - Renames with "-mirror-{hash}" suffix - Adds labels excluding it from monitoring/scanning - Tracks TTL for auto-cleanup7. Frontend polls GET /api/mirror/status/{mirror_id}8. Mirror auto-deletes after TTL expiresCommunication protocols
Section titled “Communication protocols”CRUD on failures and findings, configuration management, one-shot data retrieval.
WebSocket
Section titled “WebSocket”Real-time push:
{ "type": "pod_failure", "data": { "id": "uuid", "name": "pod-name", "namespace": "default", "reason": "CrashLoopBackOff", "solution": "..." }}Internal traffic matrix
Section titled “Internal traffic matrix”| From | To | Protocol | Purpose |
|---|---|---|---|
| Agent | Backend | HTTP/REST | Report failures |
| Agent | K8s API | HTTPS | Watch pods |
| Scanner | Backend | HTTP/REST | Report issues |
| Scanner | K8s API | HTTPS | List pods |
| Backend | PostgreSQL | TCP | Data storage |
| Backend | LLM | HTTPS | Generate solutions |
| Frontend | Backend | HTTP/WS | UI data |
Deployment topology
Section titled “Deployment topology”Namespace: kure-system├── DaemonSet: kure-agent (one pod per node)├── Deployment: kure-backend (replicas: 1-3)├── Deployment: kure-frontend (replicas: 1-3)├── Deployment: kure-security-scanner (replicas: 1)├── StatefulSet: postgresql (replicas: 1)├── Services: kure-backend, kure-frontend, postgresql├── ConfigMap: kure-config├── Secrets: <release>-bootstrap, kure-secrets├── ServiceAccounts + ClusterRoles + ClusterRoleBindings└── NetworkPolicy: kure-network-policyAgent ServiceAccount
Section titled “Agent ServiceAccount”rules: - apiGroups: [""] resources: ["pods", "pods/log", "events", "nodes"] verbs: ["get", "list", "watch"] - apiGroups: ["metrics.k8s.io"] resources: ["pods", "nodes"] verbs: ["get", "list"]Backend ServiceAccount
Section titled “Backend ServiceAccount”rules: - apiGroups: [""] resources: ["pods"] verbs: ["get", "list", "create", "delete"] - apiGroups: [""] resources: ["pods/log"] verbs: ["get"] - apiGroups: [""] resources: ["events"] verbs: ["list"] # plus diagram-related verbs (see /features/diagram/)Note: the backend ServiceAccount is intentionally not granted access to Secrets. See Topology Diagram → Security model.
Security Scanner ServiceAccount
Section titled “Security Scanner ServiceAccount”rules: - apiGroups: [""] resources: ["pods", "namespaces"] verbs: ["get", "list", "watch"]High availability
Section titled “High availability”| Component | HA strategy |
|---|---|
| Backend | 2-3 replicas |
| Frontend | 2-3 replicas |
| PostgreSQL | External managed DB (RDS, Cloud SQL) for production |
| Agent | DaemonSet (inherently HA) |
| Scanner | Single replica (stateless) |
For multi-replica backends, the bootstrap Secret keeps session-secret consistent across replicas — see Authentication.
Network isolation
Section titled “Network isolation”NetworkPolicy: - Allow Agent → Backend - Allow Scanner → Backend - Allow Frontend → Backend - Allow Backend → PostgreSQL - Allow Backend → External (LLM APIs) - Deny all other trafficContainer hardening
Section titled “Container hardening”All containers run with:
- Non-root user (UID 1001)
- Read-only root filesystem
- No privilege escalation
- Dropped capabilities
- Seccomp
RuntimeDefault