Upgrade Management
19 min
overview kubegrade upgrade management helps teams reduce manual effort and upgrade risk by turning kubernetes upgrades into a guided, auditable workflow core outcomes faster upgrade planning preflight validation before changes safer execution with approvals pr based, gitops compatible remediation where config changes are required better auditability and rollback readiness supported kubernetes versions/providers providers amazon / aws / eks googlecloud / gke microsoft azure / aks openshift digitalocean alibabacloud rancher/rke (if supported) self managed upstream kubernetes version support format (recommended) provide minimum supported version per provider tested upgrade paths (n → n+1, etc ) unsupported/skipped paths provider specific notes (managed add ons, control plane sequencing, nodegroup specifics) preflight checks (api deprecations, add ons, constraints) preflight checks are run before upgrade execution to identify blockers and risk typical checks kubernetes api deprecations in manifests add on compatibility (cni, csi, ingress controllers, metrics stack) version skew constraints pod disruption budget / capacity constraints node image/runtime prerequisites admission controller and policy conflicts deprecated helm chart apis / crd compatibility cluster health baseline (unhealthy nodes, failing workloads) output preflight results should be categorized as blockers (must fix) warnings (recommended fixes) informational notes upgrade plans (single / multi cluster) single cluster upgrade plan includes cluster target version preflight status required remediations maintenance window rollback strategy approval requirements multi cluster upgrade plan adds sequencing (development > canary → staging → production) cluster grouping by environment/provider/team parallelism constraints success gates between waves standardized runbook application recommended pattern development cluster canary cluster staging cluster(s) limited production wave broad production rollout scheduling + maintenance windows kubegrade supports scheduling upgrades and workflows within controlled windows via the agents scheduling capabilities one time scheduled execution recurring maintenance windows timezone aware windows environment specific restrictions freeze windows / blackout periods operational controls preflight refresh before execution re approval if conditions changed auto cancel on critical health regressions (if configured) what gets written where (pr contents, provenance) typical pr contents manifest or values updates version pin changes compatibility fixes config changes required by target version explanatory summary of why changes are required references to preflight findings provenance metadata (recommended) workflow run id cluster / environment policy checks passed/failed agent/version used timestamp approver(s) (linked in audit trail) runbooks common upgrade failure modes 1\) node upgrade stalls possible causes pod disruption budgets too strict capacity shortage daemonset rollout blocking drain misconfigured eviction settings response review draining events temporarily scale capacity / relax constraints (approved) retry wave 2\) workloads fail after upgrade possible causes api deprecations missed admission/policy incompatibility ingress/controller changes crd/version mismatch response review failing workloads and dependency graph impact generate remediation pr roll back workload config or cluster wave if needed 3\) add on incompatibility possible causes unsupported add on version for target k8s version csi/cni version skew metrics/monitoring stack incompatibility response pin compatible add on version upgrade add ons in required order re run preflight 4\) control plane/node version skew issues possible causes validate provider sequencing rules reconcile managed nodegroups apply supported path only response validate provider sequencing rules reconcile managed nodegroups apply supported path only