Pod Monitoring
Kure continuously watches all pods across your cluster and detects failures the moment they happen.
Detected failure types
Section titled “Detected failure types”| Failure reason | Description |
|---|---|
CrashLoopBackOff | Container keeps crashing and restarting |
ImagePullBackOff | Unable to pull container image |
ErrImagePull | Error during image pull |
CreateContainerError | Error creating container |
RunContainerError | Error running container |
OOMKilled | Container killed due to out of memory |
Pending | Pod stuck in pending state (after agent.pendingGracePeriod) |
FailedScheduling | Cannot schedule pod to any node |
FailedMount | Volume mount failure |
How AI solutions are generated
Section titled “How AI solutions are generated”- Agent detects pod failure via the Kubernetes watch API
- Agent collects context: events, logs (last 100 lines), manifest, container statuses
- Agent POSTs to backend (
/api/pods/failed) - Backend checks exclusion rules
- Backend generates a solution:
- If LLM configured → send context to the AI provider
- Otherwise → use rule-based solutions
- Backend stores in PostgreSQL
- Backend broadcasts via WebSocket to the frontend
- Frontend renders the failure card
The solution always includes:
- Diagnosis — what’s wrong
- Step-by-step fix
- Prevention tips
- Useful kubectl commands
Pod details
Section titled “Pod details”Click a failed pod to see:
- Pod metadata — node, phase, creation time
- Error details — failure reason and message
- Container statuses — state, image, restart count
- Recent events — Kubernetes events with timestamps
- AI solution — the generated guide
- Pod manifest — with lines highlighted that the AI referenced
Live logs
Section titled “Live logs”- Expand the pod row
- Click Logs
- Pick a container (if multi-container)
- Pick a tail length (50 / 100 / 500 / 1000 / 2000 lines)
- Toggle Previous container to see logs from a crashed previous instance
You can:
- Download logs as a text file
- Refresh / scroll to bottom
- Switch container without leaving the panel
Logs stream over Server-Sent Events at /api/pods/{ns}/{name}/logs/stream.
Pod lifecycle actions
Section titled “Pod lifecycle actions”| Action | Effect |
|---|---|
| Mark as Investigating (eye icon) | Show the rest of the team that someone is on it |
| Mark as Resolved (checkmark) | Move to history |
| Ignore | Hide from active view |
| Restore | Move back from history or ignored |
| Retry AI | Regenerate the AI solution with fresh context |
| Dismiss | Drop the failure |
Configure auto-cleanup of resolved and ignored pods from Admin → Settings.
Exclusions
Section titled “Exclusions”Stop monitoring noisy namespaces or short-lived pods from Admin → Suppressions:
- Namespace exclusions — e.g.
kube-system,kube-public,kure-system - Pod patterns — wildcards:
test-*,*-job-*,debug-*
Pending grace period
Section titled “Pending grace period”Pods that enter Pending are flagged only after agent.pendingGracePeriod seconds have elapsed (default 120). This avoids paging on transient pod creation latency.
agent: pendingGracePeriod: 120 # secondsMirror pod testing
Section titled “Mirror pod testing”For any failing pod with an AI solution, you can deploy a mirror pod — a temporary copy with the AI fix applied — to verify the fix works before committing to git.