Skip to content

Pod Monitoring

Kure continuously watches all pods across your cluster and detects failures the moment they happen.

Failure reasonDescription
CrashLoopBackOffContainer keeps crashing and restarting
ImagePullBackOffUnable to pull container image
ErrImagePullError during image pull
CreateContainerErrorError creating container
RunContainerErrorError running container
OOMKilledContainer killed due to out of memory
PendingPod stuck in pending state (after agent.pendingGracePeriod)
FailedSchedulingCannot schedule pod to any node
FailedMountVolume mount failure
  1. Agent detects pod failure via the Kubernetes watch API
  2. Agent collects context: events, logs (last 100 lines), manifest, container statuses
  3. Agent POSTs to backend (/api/pods/failed)
  4. Backend checks exclusion rules
  5. Backend generates a solution:
    • If LLM configured → send context to the AI provider
    • Otherwise → use rule-based solutions
  6. Backend stores in PostgreSQL
  7. Backend broadcasts via WebSocket to the frontend
  8. Frontend renders the failure card

The solution always includes:

  • Diagnosis — what’s wrong
  • Step-by-step fix
  • Prevention tips
  • Useful kubectl commands

Click a failed pod to see:

  • Pod metadata — node, phase, creation time
  • Error details — failure reason and message
  • Container statuses — state, image, restart count
  • Recent events — Kubernetes events with timestamps
  • AI solution — the generated guide
  • Pod manifest — with lines highlighted that the AI referenced
  1. Expand the pod row
  2. Click Logs
  3. Pick a container (if multi-container)
  4. Pick a tail length (50 / 100 / 500 / 1000 / 2000 lines)
  5. Toggle Previous container to see logs from a crashed previous instance

You can:

  • Download logs as a text file
  • Refresh / scroll to bottom
  • Switch container without leaving the panel

Logs stream over Server-Sent Events at /api/pods/{ns}/{name}/logs/stream.

ActionEffect
Mark as Investigating (eye icon)Show the rest of the team that someone is on it
Mark as Resolved (checkmark)Move to history
IgnoreHide from active view
RestoreMove back from history or ignored
Retry AIRegenerate the AI solution with fresh context
DismissDrop the failure

Configure auto-cleanup of resolved and ignored pods from Admin → Settings.

Stop monitoring noisy namespaces or short-lived pods from Admin → Suppressions:

  • Namespace exclusions — e.g. kube-system, kube-public, kure-system
  • Pod patterns — wildcards: test-*, *-job-*, debug-*

Pods that enter Pending are flagged only after agent.pendingGracePeriod seconds have elapsed (default 120). This avoids paging on transient pod creation latency.

agent:
pendingGracePeriod: 120 # seconds

For any failing pod with an AI solution, you can deploy a mirror pod — a temporary copy with the AI fix applied — to verify the fix works before committing to git.