Skip to content

Architecture

Kure Monitor ships five components that together provide failure detection, security scanning, and AI-powered troubleshooting.

┌─────────────────────────┐
│ LLM Providers │
│ OpenAI / Anthropic / │
│ Groq / Gemini / │
│ Copilot / Ollama │
└───────────▲─────────────┘
┌─────────────────┐ │
│ Kure Agent │ ┌──────────────────────────────────────┐
│ (DaemonSet) │───>│ Kure Backend │
│ │ │ (FastAPI) │
│ - Pod Monitor │ │ │
│ │ │ - REST API │
└─────────────────┘ │ - WebSocket Server │
│ │ - Solution Engine │
│ │ - LLM Integration │
│ └──────────────────────────────────────┘
│ │ │
v v v
┌─────────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Kubernetes │ │ PostgreSQL │ │ Kure Frontend │
│ API Server │ │ Database │ │ (React) │
└─────────────────┘ └──────────────┘ └──────────────────┘
^ │
│ │
┌─────────────────┐ │
│ Security Scanner│ │
│ (Deployment) │────────────────────────────────┘
└─────────────────┘ WebSocket Updates

Watches the Kubernetes API for pod failures and reports them to the backend.

  • Stack: Python 3.11, Kubernetes Python client, asyncio, aiohttp
  • Runs: one pod per node
  • Auth: X-Service-Token to backend

Sample report:

{
"name": "failing-pod",
"namespace": "default",
"reason": "CrashLoopBackOff",
"message": "Back-off restarting failed container",
"events": [...],
"logs": "Error: Cannot connect to database...",
"manifest": {...},
"container_statuses": [...]
}

The brain. Receives reports, generates AI solutions, stores results, broadcasts to the frontend.

  • Stack: Python 3.11, FastAPI, Pydantic, asyncpg, WebSockets
  • REST API surface (abridged):
/api
├── /auth # status, login, signup, me
├── /pods # failures, history, ignored, logs, mirror
├── /mirror # preview, deploy, status, delete, active
├── /security # findings, fixes, trusted registries
├── /diagram # namespaces, namespace graph, workload graph, manifest
├── /admin
│ ├── /llm # LLM config + test
│ ├── /excluded-* # suppressions
│ ├── /notifications # Slack / Teams
│ └── /settings/* # retention, mirror TTL
└── /ws # WebSocket

See the full API Reference.

React 18 + Tailwind dashboard. Real-time updates via WebSocket.

RoutePurpose
/loginSign-in / initial admin setup
/Pod failures dashboard
/securitySecurity findings
/diagramTopology graph (2.3.2+; RBAC mode added in 2.3.3)
/adminAdmin panel

Audits all pods for security misconfigurations on a schedule and on demand. Reports findings to the backend with X-Service-Token.

Persistent storage for failures, security findings, admin settings, exclusions.

TableDescription
failed_podsPod failure records with solutions
security_issuesSecurity scan results
admin_configLLM and notification settings
pod_exclusionsExcluded pod patterns
namespace_exclusionsExcluded namespaces
usersDashboard accounts
1. Pod enters failure state
2. Agent detects via watch API
3. Agent collects events, logs, manifest, container statuses
4. Agent POSTs /api/pods/failed (X-Service-Token)
5. Backend checks exclusion rules
6. Backend generates solution (LLM or rule-based fallback)
7. Backend stores in PostgreSQL
8. Backend broadcasts via WebSocket
9. Frontend renders the failure card
1. Scanner lists all pods
2. Runs security checks per pod
3. POSTs findings to backend
4. Backend stores in PostgreSQL
5. Frontend displays in Security tab
1. User clicks "Test Fix" on a failing pod
2. Frontend → POST /api/mirror/preview/{pod_id}
3. Backend fetches failure context, calls LLM, returns fixed manifest + explanation
4. (Optional) user edits manifest
5. User clicks Deploy → POST /api/mirror/deploy/{pod_id}
6. Backend creates mirror pod via K8s API:
- Renames with "-mirror-{hash}" suffix
- Adds labels excluding it from monitoring/scanning
- Tracks TTL for auto-cleanup
7. Frontend polls GET /api/mirror/status/{mirror_id}
8. Mirror auto-deletes after TTL expires

CRUD on failures and findings, configuration management, one-shot data retrieval.

Real-time push:

{
"type": "pod_failure",
"data": {
"id": "uuid",
"name": "pod-name",
"namespace": "default",
"reason": "CrashLoopBackOff",
"solution": "..."
}
}
FromToProtocolPurpose
AgentBackendHTTP/RESTReport failures
AgentK8s APIHTTPSWatch pods
ScannerBackendHTTP/RESTReport issues
ScannerK8s APIHTTPSList pods
BackendPostgreSQLTCPData storage
BackendLLMHTTPSGenerate solutions
FrontendBackendHTTP/WSUI data
Namespace: kure-system
├── DaemonSet: kure-agent (one pod per node)
├── Deployment: kure-backend (replicas: 1-3)
├── Deployment: kure-frontend (replicas: 1-3)
├── Deployment: kure-security-scanner (replicas: 1)
├── StatefulSet: postgresql (replicas: 1)
├── Services: kure-backend, kure-frontend, postgresql
├── ConfigMap: kure-config
├── Secrets: <release>-bootstrap, kure-secrets
├── ServiceAccounts + ClusterRoles + ClusterRoleBindings
└── NetworkPolicy: kure-network-policy
rules:
- apiGroups: [""]
resources: ["pods", "pods/log", "events", "nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: ["metrics.k8s.io"]
resources: ["pods", "nodes"]
verbs: ["get", "list"]
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "create", "delete"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get"]
- apiGroups: [""]
resources: ["events"]
verbs: ["list"]
# plus diagram-related verbs (see /features/diagram/)

Note: the backend ServiceAccount is intentionally not granted access to Secrets. See Topology Diagram → Security model.

rules:
- apiGroups: [""]
resources: ["pods", "namespaces"]
verbs: ["get", "list", "watch"]
ComponentHA strategy
Backend2-3 replicas
Frontend2-3 replicas
PostgreSQLExternal managed DB (RDS, Cloud SQL) for production
AgentDaemonSet (inherently HA)
ScannerSingle replica (stateless)

For multi-replica backends, the bootstrap Secret keeps session-secret consistent across replicas — see Authentication.

NetworkPolicy:
- Allow Agent → Backend
- Allow Scanner → Backend
- Allow Frontend → Backend
- Allow Backend → PostgreSQL
- Allow Backend → External (LLM APIs)
- Deny all other traffic

All containers run with:

  • Non-root user (UID 1001)
  • Read-only root filesystem
  • No privilege escalation
  • Dropped capabilities
  • Seccomp RuntimeDefault