Troubleshooting

9 min

overview kubegrade troubleshooting turns kubernetes troubleshooting from a reactive, manual process into a guided, context rich workflow with ai, automation and human control guided investigations (what kubegrade inspects) kubegrade agents investigate issues using correlated signals across kubernetes layers common inspection domains pod and container status events and recent changes resource pressure (cpu/memory) scheduling constraints service/endpoints/ingress routing dns/network indicators configmaps/secrets references and misconfigurations dependency relationships across workloads/services common incident playbooks crashloopbackoff kubegrade checks container logs exit codes liveness/readiness probe failures missing env/config/secret references recent config changes typical outputs root cause candidates impacted dependent services proposed config or probe adjustments via pr oomkilled / resource exhaustion kubegrade checks memory limits/requests actual usage patterns recent release changes node pressure and eviction signals typical outputs right size recommendations hpa/resource tuning suggestions pr ready config updates (if enabled) 5xx / ingress/service routing issues kubegrade checks ingress/controller status service selectors/endpoints pod readiness upstream dependency health dns/network path issues dns / network connectivity issues kubegrade checks coredns status service dns resolution networkpolicy constraints endpoint availability namespace/service name mismatches scheduling failures kubegrade checks resource requests vs node capacity taints/tolerations node selectors / affinity pdb interactions evidence collection + timelines evidence types events status transitions metrics snapshots dependency graph changes config diffs / git changes alert triggers (if integrated) timelines a useful incident timeline should show first observed symptom related alerts and state transitions changes preceding the incident actions taken (manual or kubegrade assisted) recovery confirmation suggested remediations → pr workflow kubegrade can convert troubleshooting findings into reviewable changes flow investigation identifies root cause candidates kubegrade proposes remediation options user selects/edits preferred option kubegrade generates pr against git/iac source user reviews and merges gitops applies kubegrade verifies recovery signals post incident review / export after resolution, the kubegrade assistant agent can support review and reporting post incident review contents incident summary affected clusters/namespaces/services timeline of events and changes root cause (confirmed/probable) actions taken pr links / approvals prevention follow ups