Overview

174 min

welcome to kubegrade kubegrade is an operational layer for kubernetes teams that need to keep clusters reliable, auditable, and maintainable as environments grow it helps platform teams and developers understand what is running across clusters detect risk (drift, misconfigurations, upgrade issues) investigate incidents faster propose safe fixes as pull requests execute changes through existing gitops workflows kubegrade is built for real production environments and supports cloud, on prem, and hybrid deployments ( kubegrade https //kubegrade com/ ) what kubegrade is (and isn’t) what kubegrade is kubegrade is a kubernetes operations platform focused on day 2 operations , including upgrade management troubleshooting drift detection visualization and dependency mapping gitops based remediation guided/agentic workflows with human approval it combines cluster state, configuration context, and workflow automation to help teams move from detection to action faster ( kubegrade https //kubegrade com/ ) what kubegrade isn’t kubegrade is not a replacement for your kubernetes provider (eks/gke/aks/openshift) a cluster provisioner by default a generic ci/cd platform a pure observability tool a “black box” auto remediator that bypasses human review kubegrade is designed to work with your existing tooling and processes, not replace them the platform emphasizes human in the loop approvals and gitops execution ( kubegrade https //kubegrade com/ ) architecture at a glance kubegrade typically includes the following components 1\) kubegrade control plane the central service/ui where users view cluster posture run investigations review plans and recommendations approve remediations manage policies, roles, and workflows 2\) in cluster agent(s) lightweight agents/controllers deployed in target clusters that collect telemetry/metadata execute approved workflows respect cluster rbac and network boundaries support upgrade planning and operational actions your current docs draft already describes kubegrade agents as lightweight controllers that execute plans and collect telemetry securely, while respecting rbac/network isolation ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) 3\) iac / git / gitops integrations kubegrade connects to iac/config sources (terraform, helm, kustomize) git repositories gitops systems (argo cd / flux) to generate auditable prs and align runtime state with declared state 4\) observability / notification integrations kubegrade can connect to monitoring, alerting, and collaboration systems to enrich context and route actions core workflow scan → visualize → propose pr → approve → execute (gitops) this is kubegrade’s core operating loop and mirrors your current site workflow ( kubegrade https //kubegrade com/ ) 1\) scan kubegrade analyzes cluster resources and workload state events and signals configuration and policy posture iac / git definitions (when connected) purpose identify drift, risk, failures, or upgrade blockers build context for troubleshooting and change recommendations 2\) visualize kubegrade presents interactive views of cluster objects and dependencies service relationships warning states and impacted components relevant context for investigations this aligns with your current positioning around interactive dependency graphs and detailed insights ( kubegrade https //kubegrade com/ ) 3\) propose pr kubegrade drafts changes as pull requests (e g , yaml, helm values, terraform changes) so teams can review proposed remediations in their existing workflow your homepage explicitly references ai drafted prs and gitops execution ( kubegrade https //kubegrade com/ ) 4\) approve teams review proposed changes safety checks and policy outcomes scope of impact execution plan / timing kubegrade is designed to keep humans in the loop ( kubegrade https //kubegrade com/ ) 5\) execute (gitops) changes are applied through your gitops process (e g , argo cd or flux), preserving auditability review controls provenance of change rollback paths glossary agent a lightweight kubegrade component deployed in a cluster that collects metadata/telemetry and executes approved workflows within the permissions granted to it ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) audit trail a record of user actions, system recommendations, approvals, and executed changes cluster a kubernetes environment managed by a provider or self hosted control plane drift a mismatch between desired configuration (git/iac) and actual runtime cluster state environment a logical stage of delivery such as dev, staging, or production gitops remediation a workflow where kubegrade proposes infrastructure/application changes as pull requests and changes are applied by gitops tooling after merge human in the loop a control model where kubegrade can automate analysis and recommendations, but users review/approve before execution policy / guardrail a rule that constrains or validates what changes are allowed, who can run actions, and when project a logical grouping of clusters, workflows, and policies tied to a team or application area workspace a higher level boundary used to organize projects, clusters, users, and policies (especially for larger teams) your draft docs already mention workspaces grouping clusters, policies, and automation workflows ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) getting started quickstart (first cluster in 5 mins) this quickstart gets you from account access to a connected cluster and first safe change proposal step 1 — sign in to kubegrade create/login to your kubegrade account select or create an organization/workspace step 2 — add a cluster choose your provider (eks / gke / aks / openshift / self managed) install the kubegrade agent into the target cluster confirm cluster registration in the ui step 3 — connect gitops and/or iac (recommended) link your git provider/repository optionally connect terraform/helm/kustomize sources map cluster/workspace to repo(s) step 4 — run an initial scan trigger a cluster scan review health, warnings, drift indicators, and object graph step 5 — run a safe workflow example select a non production namespace trigger a validation/fix recommendation workflow review generated pr approve and merge via your normal git process confirm rollout status in gitops requirements minimum platform requirements a kubernetes cluster you can administer (or delegated access to install an agent) permission to create kubernetes resources (agent install) outbound network access from the cluster to kubegrade endpoints (for saas/hybrid modes) access to your git/iac repositories for pr workflows (optional but strongly recommended) recommended operational prerequisites one gitops tool (argo cd or flux) namespace/environment conventions (dev/stage/prod) basic rbac model for platform/dev/auditor roles staging environment for testing before production rollout supported targets (high level) managed kubernetes providers (eks, gke, aks) enterprise/self managed clusters (openshift, rancher/rke, upstream kubernetes) hybrid/multi cluster environments deployment options (saas / on prem / hybrid) saas kubegrade control plane is hosted by kubegrade you install agents in your clusters and connect supported integrations best for faster setup lower operational overhead centralized management on prem kubegrade control plane and data plane components run in your environment best for regulated workloads strict data residency or network constraints enterprises requiring private deployment models hybrid a mix of hosted and private components (e g , hosted control plane with private agents/connectivity and controlled data paths) best for enterprises that want managed ux plus controlled execution/data boundaries your homepage explicitly states cloud, on prem, and hybrid support ( kubegrade https //kubegrade com/ ) connect your first cluster 1\) choose provider and cluster from the kubegrade ui select add cluster choose provider type name the cluster and assign environment (dev/stage/prod) assign workspace/project 2\) install kubegrade agent deploy the provided manifest/helm chart to the cluster typical installation includes namespace for kubegrade agent components service account and rbac resources deployment(s) for agent/controller optional secret/config resources (depending on integration mode) 3\) register and validate kubegrade validates agent heartbeat api connectivity permission scope basic cluster inventory access 4\) assign scope optionally restrict the cluster integration to specific namespaces specific actions/workflows read only mode for initial rollout connect iac kubegrade becomes more powerful when it can compare live state against declared configuration and generate changes in the same systems your team already uses supported iac/config types terraform helm kustomize what connecting iac enables runtime vs iac comparison drift detection with source context pr generation against your actual config repo safer remediation with reviewable changes typical setup flow connect git provider/repository select repo path(s) identify config format (terraform/helm/kustomize) map repo to cluster/workspace/environment validate access and test diff parsing your site and docs drafts already emphasize cluster + iac context, pr suggestions, and support for helm/terraform/argo cd integrations ( kubegrade https //kubegrade com/ ) verify access + health checks after cluster and iac connection, confirm the following cluster connectivity agent is online cluster api reachable inventory sync successful permissions agent can read required resource types write/execute permissions align with intended mode (read only vs execution enabled) namespace restrictions are correctly enforced integrations git repository access verified gitops system connection valid (if configured) observability integrations reachable (if configured) health checks (recommended) run a metadata only scan open dependency graph view test a dry run recommendation (no execution) review audit log entries for setup actions first workflow run (example safe change via pr) use a low risk, non production example to validate end to end flow example workflow fix configuration drift in a dev namespace select cluster and namespace (dev) run drift detection scan review identified drift ask kubegrade to propose remediation review generated pr (files, diffs, rationale) approve pr generation merge pr in git provider let argo cd / flux apply the change confirm drift resolved in kubegrade what success looks like drift warning cleared pr linked in audit trail gitops rollout status visible no unapproved direct mutation in cluster platform concepts organizations, workspaces, projects organization top level account boundary for users billing security settings shared policies and integrations workspace operational grouping for teams or business units your draft docs describe workspaces as grouping clusters, policies, and automation workflows ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) use workspaces to separate teams business domains regions compliance zones project a scoped unit within a workspace used to group related clusters applications/services policies workflow defaults recommended pattern org = company workspace = platform domain / bu / region project = product/service/application group clusters and environments (dev/stage/prod) kubegrade supports managing multiple clusters and mapping them to lifecycle environments cluster a kubernetes environment (eks/gke/aks/openshift/self managed) connected to kubegrade environment a logical delivery stage dev experimentation, low risk validation staging production like testing and upgrade rehearsals prod business critical execution with stricter controls best practices tag clusters with environment metadata apply stricter guardrails to production rehearse upgrades/remediations in staging before prod use separate approval policies per environment your draft docs already define clusters as logical units representing environments such as dev/staging/prod ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) namespaces and scope controls namespaces are the primary unit for multi tenant control within a cluster kubegrade supports scope control at multiple layers cluster level scope namespace level scope workflow/action level scope role based access scope common use cases developers can view and run approved workflows only in assigned namespaces platform admins can manage upgrades cluster wide auditors get read only access to dashboards and logs recommended rollout start with read only access across the cluster enable execution for non prod namespaces first expand gradually with explicit policies policies and guardrails policies and guardrails define what kubegrade can recommend or execute, and under what conditions policy types (typical) approval requirements (who must approve) time windows (maintenance windows only) scope restrictions (namespace/cluster/workspace) change restrictions (e g , no prod write actions without multi approval) execution mode (read only, suggest only, execute via gitops) severity based routing (critical issues escalate) guardrail goals prevent unsafe actions maintain consistency enforce compliance requirements preserve human oversight human in the loop approvals kubegrade is designed to support automation without removing operator control your site and troubleshooting page both emphasize human in the loop operation ( kubegrade https //kubegrade com/ ) what human in the loop means in kubegrade kubegrade can analyze and propose actions users review generated plans/prs execution requires approval based on policy gitops remains the system of execution when enabled approval stages (common) recommendation review pr generation approval merge approval (in git provider) post change verification review (optional) why this matters reduced risk better auditability easier adoption by platform teams compliance alignment audit trail model kubegrade maintains a traceable history of operational activity typical audit events user sign in / role changes cluster connection updates policy changes workflow runs recommendations generated prs created/linked approvals granted/denied execution outcomes rollback actions audit trail goals explain what changed show who approved it link recommendations to actual diffs/prs support incident reviews and compliance checks your draft docs mention audit logs and historical audit timelines/execution trails ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) roles & permissions (owner / admin / developer / viewer) your draft docs already describe a role model including organization owner, cluster administrator/admin, developer, and viewer/auditor ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) owner full access to the organization can typically manage billing and org wide settings manage all users and roles configure identity integration view all projects/clusters/workspaces override policy defaults (if permitted) admin administrative access within assigned workspace/project/cluster scopes can typically connect clusters and integrations manage policies and team assignments run and approve operational workflows (within scope) configure execution settings developer operational access for application teams within assigned scopes can typically view cluster/workload state run approved workflows in assigned namespaces review recommendations and open prs access dashboards and troubleshooting views viewer read only access can typically view dashboards, graphs, and audit logs review compliance/audit posture export reports (if enabled) cannot execute or approve changes security & infrastructure security overview kubegrade is designed for enterprise kubernetes operations with strong emphasis on least privilege access human in the loop approvals gitops based change execution auditability support for private/on prem/hybrid deployments your public messaging already emphasizes secure, auditable workflows and human oversight ( kubegrade https //kubegrade com/ ) data handling & what leaves the cluster kubegrade should be documented as using a minimum necessary data model, with deployment mode differences typical data used by kubegrade kubernetes object metadata and configuration state events / status conditions health and dependency context workflow execution metadata optional integration metadata (git/iac references, alert identifiers) what may leave the cluster (saas / hybrid) depending on configuration telemetry metadata resource state summaries policy/drift findings workflow request/response metadata optional logs/diagnostic snippets (if enabled) what should be explicitly documented by deployment mode whether payloads include object specs vs metadata only whether secrets are collected (recommended no secret values) retention periods encryption in transit / at rest regional hosting options (if any) your draft docs note agents collecting telemetry securely and not exfiltrating beyond telemetry metadata; formalize exact behavior here ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) network requirements (egress, proxies) outbound connectivity (typical) agents usually require outbound access to kubegrade control plane endpoints identity endpoints (if sso enabled) git provider apis (if pr generation enabled) optional observability apis/integrations inbound connectivity prefer no inbound internet exposure to clusters for saas mode (outbound only agent pattern), where possible proxy support document support for http/https proxy authenticated proxies custom ca bundles no proxy exclusions for internal services recommended documentation items exact fqdns / ports tls requirements timeout/retry behavior firewall allowlist guidance private connectivity (vpc/vnet peering / privatelink equivalent if applicable) if supported, document private connectivity options by cloud aws privatelink / vpc peering / transit gateway patterns azure private endpoint / vnet peering gcp private service connect / vpc peering enterprise private routing for on prem hybrid deployments if not yet ga mark as “available by deployment model / enterprise request” document supported reference architectures clarify what remains public (if anything) ip allowlisting for environments enforcing outbound restrictions publish kubegrade egress destination ips/cidrs (if static) or publish endpoint domains with certificate pinning guidance document ip rotation/change notification policy provide separate lists for saas regions if applicable for inbound (if any webhooks/callbacks used) publish kubegrade source ips for webhook delivery tls / certificates in transit tls for agent ↔ control plane communication tls for web ui/api access tls for integration calls (git, observability, identity providers) enterprise requirements document support for custom ca trust bundles internal pki certificates (on prem) tls termination models (if self hosted) certificate rotation practices best practice never disable certificate verification in production rotate credentials/certs on schedule separate certs per environment where possible secrets management kubegrade should avoid collecting or storing secret values unless strictly required recommended approach use kubernetes secrets only for agent/integration credentials where necessary prefer external secret managers and short lived credentials scope credentials per workspace/project/integration encrypt secrets at rest redact secrets from logs/ui/audit exports documentation should specify where secrets live (saas/on prem/hybrid differences) rotation procedures revocation procedures how to re authenticate integrations without downtime rbac model + least privilege your draft docs already emphasize access controls and agent respect for rbac ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) rbac principles least privilege by default read only initial onboarding mode explicit elevation for execution enabled workflows namespace scoping where practical separate permissions for analysis vs action kubernetes side rbac document required api groups/resources for read operations additional permissions required for execution workflows optional privileges by module (upgrades, drift remediation, troubleshooting actions) kubegrade app rbac document owner/admin/developer/viewer permissions scope inheritance (org → workspace → project → cluster) approval authority vs execution authority separation compliance notes (soc2/iso) use this page as a factual capability page (not a marketing claim page) include only confirmed claims document compliance frameworks supported by platform controls (e g , audit logs, rbac, approval workflows, least privilege) security practices relevant to customer audits deployment models for regulated environments (on prem/hybrid) data handling controls if certifications are in progress state clearly “soc 2 / iso 27001 status” (in progress / planned / certified) scope of certification (company vs product vs hosting environment) do not overstate certification or attestation status unless already formalized product modules (mirrors your website product pages) upgrade management overview kubegrade upgrade management helps teams reduce manual effort and upgrade risk by turning kubernetes upgrades into a guided, auditable workflow core outcomes faster upgrade planning preflight validation before changes safer execution with approvals pr based, gitops compatible remediation where config changes are required better auditability and rollback readiness your public messaging strongly emphasizes upgrade automation, safety checks, and reduced manual toil ( kubegrade https //kubegrade com/ ) supported kubernetes versions/providers document support as a compatibility matrix providers eks gke aks openshift rancher/rke (if supported) self managed upstream kubernetes version support format (recommended) provide minimum supported version per provider tested upgrade paths (n → n+1, etc ) unsupported/skipped paths provider specific notes (managed add ons, control plane sequencing, nodegroup specifics) preflight checks (api deprecations, add ons, constraints) preflight checks are run before upgrade execution to identify blockers and risk typical checks kubernetes api deprecations in manifests add on compatibility (cni, csi, ingress controllers, metrics stack) version skew constraints pod disruption budget / capacity constraints node image/runtime prerequisites admission controller and policy conflicts deprecated helm chart apis / crd compatibility cluster health baseline (unhealthy nodes, failing workloads) output preflight results should be categorized as blockers (must fix) warnings (recommended fixes) informational notes upgrade plans (single / multi cluster) single cluster upgrade plan includes cluster target version preflight status required remediations maintenance window rollback strategy approval requirements multi cluster upgrade plan adds sequencing (canary → staging → production) cluster grouping by environment/provider/team parallelism constraints success gates between waves standardized runbook application recommended pattern canary cluster staging cluster(s) limited production wave broad production rollout scheduling + maintenance windows kubegrade should support scheduling upgrades and workflows within controlled windows scheduling capabilities (recommended) one time scheduled execution recurring maintenance windows timezone aware windows environment specific restrictions freeze windows / blackout periods operational controls preflight refresh before execution re approval if conditions changed auto cancel on critical health regressions (if configured) what gets written where (pr contents, provenance) when kubegrade proposes upgrade related changes, document exactly what gets generated typical pr contents manifest or values updates version pin changes compatibility fixes config changes required by target version explanatory summary of why changes are required references to preflight findings provenance metadata (recommended) workflow run id cluster / environment policy checks passed/failed agent/version used timestamp approver(s) (linked in audit trail) this aligns with your pr suggestion + gitops execution model and auditability positioning ( kubegrade https //kubegrade com/ ) runbooks common upgrade failure modes 1\) node upgrade stalls possible causes pod disruption budgets too strict capacity shortage daemonset rollout blocking drain misconfigured eviction settings response review draining events temporarily scale capacity / relax constraints (approved) retry wave 2\) workloads fail after upgrade possible causes api deprecations missed admission/policy incompatibility ingress/controller changes crd/version mismatch response review failing workloads and dependency graph impact generate remediation pr roll back workload config or cluster wave if needed 3\) add on incompatibility possible causes unsupported add on version for target k8s version csi/cni version skew metrics/monitoring stack incompatibility response pin compatible add on version upgrade add ons in required order re run preflight 4\) control plane/node version skew issues response validate provider sequencing rules reconcile managed nodegroups apply supported path only troubleshooting overview kubegrade troubleshooting turns kubernetes troubleshooting from a reactive, manual process into a guided, context rich workflow with automation and human control your current product page explicitly positions troubleshooting as intelligent/automated, with multi layer diagnosis, dependency graphs, and human in the loop controls ( kubegrade https //kubegrade com/troubleshooting/ ) guided investigations (what kubegrade inspects) kubegrade investigates issues using correlated signals across kubernetes layers common inspection domains pod and container status events and recent changes resource pressure (cpu/memory) scheduling constraints service/endpoints/ingress routing dns/network indicators configmaps/secrets references and misconfigurations dependency relationships across workloads/services your troubleshooting page specifically mentions issue propagation via dependency graphs and common problem categories including dns, scheduling, configmaps/secrets, and performance/resource bottlenecks ( kubegrade https //kubegrade com/troubleshooting/ ) common incident playbooks (crashloopbackoff, oom, 5xx, dns, etc ) crashloopbackoff kubegrade checks container logs exit codes liveness/readiness probe failures missing env/config/secret references recent config changes typical outputs root cause candidates impacted dependent services proposed config or probe adjustments via pr oomkilled / resource exhaustion kubegrade checks memory limits/requests actual usage patterns recent release changes node pressure and eviction signals typical outputs right size recommendations hpa/resource tuning suggestions pr ready config updates (if enabled) 5xx / ingress/service routing issues kubegrade checks ingress/controller status service selectors/endpoints pod readiness upstream dependency health dns/network path issues dns / network connectivity issues kubegrade checks coredns status service dns resolution networkpolicy constraints endpoint availability namespace/service name mismatches scheduling failures kubegrade checks resource requests vs node capacity taints/tolerations node selectors / affinity pdb interactions evidence collection + timelines kubegrade should present investigations as a structured evidence chain evidence types events status transitions metrics snapshots dependency graph changes config diffs / git changes alert triggers (if integrated) timelines a useful incident timeline should show first observed symptom related alerts and state transitions changes preceding the incident actions taken (manual or kubegrade assisted) recovery confirmation your draft docs mention historical audit timelines and execution trails, which can be reused here ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) suggested remediations → pr workflow for supported cases, kubegrade can convert troubleshooting findings into reviewable changes flow investigation identifies root cause candidates kubegrade proposes remediation options user selects/edits preferred option kubegrade generates pr against git/iac source user reviews and merges gitops applies kubegrade verifies recovery signals this directly matches your pr + gitops model ( kubegrade https //kubegrade com/ ) post incident review / export after resolution, kubegrade can support review and reporting post incident review contents incident summary affected clusters/namespaces/services timeline of events and changes root cause (confirmed/probable) actions taken pr links / approvals prevention follow ups export options (recommended) json/csv event export pdf/markdown incident report webhook push to ticketing systems linkable audit record drift detection overview kubegrade drift detection identifies mismatches between your live cluster state and intended state (git/iac/policy baselines) so teams can correct drift before it becomes an outage or audit problem your homepage and draft docs explicitly reference drift detection and cluster+iac context ( kubegrade https //kubegrade com/ ) drift sources (cluster vs iac vs git) cluster vs iac drift live cluster resources differ from terraform/helm/kustomize defined state cluster vs git drift git tracked manifests/configs no longer match what is running in the cluster policy baseline drift cluster/workload configuration diverges from internal standards or approved baselines cross environment drift staging/prod environments diverge unintentionally over time drift policies (what to flag / severity) what to flag (examples) manual changes outside gitops resource limits/requests changed in cluster only networkpolicy differences ingress/service selector mismatches deprecated/unsupported api usage missing labels/annotations required by policy image tag/version deviations rbac drift severity model (recommended) critical security/compliance risk, production impact likely high likely operational risk or audit issue medium inconsistency or future risk low informational / cosmetic drift drift remediation via pr kubegrade should support turning drift findings into prs that restore desired state remediation modes reconcile cluster to git/iac (preferred in gitops environments) update git/iac to reflect approved runtime changes (controlled exceptions) open review only recommendation without pr pr contents exact diff to restore alignment drift classification and severity evidence of where mismatch was detected impact notes (if applicable) exclusions and suppression rules not all drift should trigger alerts common exclusions auto generated labels/annotations runtime status fields ephemeral resources known provider managed mutations approved temporary overrides (time bound) suppression best practices scope suppressions narrowly (resource/namespace/path) add expiration date require reason/comment track suppressions in audit logs intelligent dependency graphs overview kubegrade provides deep cluster visualization and dependency mapping to help teams understand relationships, blast radius, and warning states quickly your site and docs drafts both reference interactive dependency graphs and dynamic visual dashboards ( kubegrade https //kubegrade com/ ) object level graph model (workloads/services/ingress/network) kubegrade’s graph should model object level relationships such as namespace → workload (deployment/statefulset/daemonset) workload → pods service → endpoints/pods ingress → services configmap/secret references networkpolicy relationships storage dependencies (pvc/pv) optional external dependencies (via integrations/annotations) purpose faster root cause analysis change impact assessment upgrade dependency planning filters, scopes, saved views filters cluster / environment namespace application/service resource kind health state / warning severity owner/team labels scopes users should be able to graph full cluster namespace only service centric blast radius incident specific subgraph saved views save and reuse views for critical services upgrade canary scope audit/compliance reviews team handoffs warning states + “what changed” views warning states overlay graph nodes/edges with states such as degraded pending / unschedulable failed probes policy violation drift detected upgrade blocker “what changed” views compare graph snapshots over time to identify new/removed dependencies routing changes config linked impact service endpoint shifts this is especially useful for post incident reviews and upgrade regressions sharing and exporting views sharing link based internal sharing (rbac respected) saved workspace/project views embed in incident tickets/docs (if supported) exporting image export (png/svg) data export (json graph metadata) snapshot attachment to incident reports ai agents overview (agents are “goal oriented workflows”) kubegrade ai agents are goal oriented operational workflows designed to help teams complete specific kubernetes tasks (e g , troubleshooting, upgrades, pr generation) with strong context and human control your draft docs and public pages already describe agentic workflows, ai driven assistance, and human in the loop operation ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) built in agents (upgrade, troubleshoot, pr generation) upgrade agent purpose analyze target version readiness run preflight checks propose upgrade sequencing and remediation steps generate prs for required config changes troubleshooting agent purpose investigate symptoms across multiple signal layers identify probable causes suggest and optionally prepare remediations pr generation agent purpose convert approved remediation plans into repo specific pull requests preserve format/structure conventions attach rationale and provenance metadata custom agents (your mcp / tool connectors model) custom agents let teams define workflows that combine kubegrade context with external tool context typical custom agent inputs cluster metadata and object state dependency graph context iac source context (terraform/helm/kustomize) git/gitops state external tools via connectors/mcps (e g , terraform/argocd/etc ) example custom use cases policy enforcement workflows scheduled drift checks + pr creation environment readiness checks before releases cost/risk posture reviews across fleet prompting patterns + guardrails even for ai assisted workflows, outputs should be constrained recommended prompting patterns goal + scope + constraints (cluster/env/namespace) desired action mode (analyze only / propose / pr ready) risk tolerance and change restrictions output format requirements (summary, checklist, pr body) guardrails read only by default for new workflows policy checks before proposing or executing explicit approval for pr generation/execution scope limits (namespace/workspace/project) logging/audit of agent actions and outputs approval + execution controls kubegrade keeps ai assisted workflows under operator control controls to document approval requirements by severity/environment who can run agents vs who can approve results suggest only mode vs pr generation mode execution via gitops only (recommended default for prod) time window restrictions feedback loop (how to improve suggestions) kubegrade can improve agent usefulness over time through operator feedback feedback signals (recommended) accepted vs rejected suggestions edits made before pr generation rollback frequency time to resolution after applying recommendation “incorrect root cause” or “missing context” tags why it matters better remediation quality less noisy suggestions more consistent outputs across teams your draft docs already mention learning from feedback to refine future operations ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) gitops remediation overview gitops remediation is kubegrade’s workflow for converting operational findings (drift, incidents, upgrade blockers, optimization recommendations) into reviewable pull requests and applying changes through your existing gitops system your site explicitly describes pr suggestions and gitops execution ( kubegrade https //kubegrade com/ ) supported repo layouts document support patterns clearly common layouts monorepo (all environments/services) environment per folder service per folder repo per service platform repo + app repos split helm values repos terraform module + environment composition repos best practice require users to define repo path mappings per cluster/environment file ownership / codeowners expectations branch strategy pr generation rules (naming, commit strategy, reviewers) branch naming (recommended) kubegrade/\<module>/\<cluster>/\<issue or workflow id> examples kubegrade/drift/prod eu1/restore nginx limits kubegrade/upgrade/staging/1 29 preflight fixes commit strategy one commit per logical remediation (preferred) squash option for noisy generated changes signed commits if required by org policy reviewers auto assign based on codeowners add platform approvers for prod scopes tag relevant service owners for workload level changes policy checks before pr before opening a pr, kubegrade should validate scope and permissions policy compliance environment restrictions change type restrictions required approvals optional dry run validation/tests (if integrated) if checks fail open recommendation without pr show blocking reasons and remediation steps merge to apply mechanics kubegrade should document how changes get applied after pr merge typical flow pr merged in git provider gitops tool (argo cd / flux) detects commit gitops sync applies changes to cluster kubegrade watches rollout status / post change signals audit trail links pr → sync → outcome recommended controls sync windows manual sync for production health checks before marking complete rollback prs rollback support should be explicit and safe rollback approaches revert generated pr commit(s) generate rollback pr from known good state partial rollback for specific resources (advanced / policy controlled) rollback triggers (examples) failed post change health checks elevated error rates dependency graph degradation slo breach detected via integrations auditability change provenance every gitops remediation should be traceable provenance fields (recommended) workflow source (drift/troubleshoot/upgrade/etc ) cluster/environment/namespace agent/workflow version trigger type (manual/scheduled/alert driven) approvers and timestamps pr link and commit sha gitops sync result verification result alert sorting (coming soon) overview alert sorting will help teams reduce alert noise by grouping related events, surfacing high priority issues, and connecting alerts to kubegrade workflows for faster action this module is coming soon and details may evolve ingest sources (prometheus/alertmanager etc ) planned/common sources prometheus alertmanager grafana alerts datadog monitors pagerduty incidents/events webhook based alert feeds additional observability tools (as integrations expand) dedup, grouping, suppression planned capabilities deduplicate repeated alerts group by service/namespace/cluster/root cause candidate suppress known noisy or low signal alerts correlate alerts with active incidents and graph state routing rules (team/service/cluster) planned routing based on team ownership cluster/environment namespace/service labels severity time windows/on call schedules (via integrations) suggested actions + links into workflows planned actions open troubleshooting investigation with alert context jump to impacted dependency graph trigger drift scan on affected namespace draft remediation pr (where applicable) create/update jira/pagerduty ticket noise tuning playbook recommended future playbook structure identify top noisy alert classes define ownership labels and routing metadata add dedup/grouping rules add suppressions with expiry track mtta/mttr and false positive reduction review monthly fleet management (coming soon) overview fleet management will help platform teams standardize and operate many clusters consistently across providers, teams, and environments this module is coming soon and behavior/capabilities may evolve standardization (policies/baselines across clusters) planned capabilities shared policy baselines cluster posture templates environment specific controls compliance baseline comparisons bulk actions (safe modes only) planned safe actions bulk scans bulk policy checks bulk upgrade planning bulk drift posture refresh bulk pr generation (with per cluster review gates) recommended constraint no blind bulk execution in production without approval and wave controls drift/upgrade posture dashboards planned dashboards upgrade readiness across fleet drift severity distribution policy compliance by cluster/team risk hotspots by provider/environment segmentation (by env/team/provider) fleet views should segment by environment (dev/stage/prod) team/workspace/project cloud provider region compliance zone cluster criticality tier integrations kubernetes providers eks connect amazon eks clusters to kubegrade for inventory and health visibility upgrade planning and preflight checks drift and troubleshooting workflows gitops remediation suggestions document auth method(s) required rbac eks specific upgrade sequence notes managed add on compatibility considerations gke connect google kubernetes engine clusters for cluster/workload visibility troubleshooting and drift analysis upgrade readiness and planning workflows document gke auth options autopilot vs standard support differences (if any) gke version/support caveats aks connect azure kubernetes service clusters for upgrade readiness and execution planning drift detection troubleshooting and dependency mapping document auth and identity setup nodepool considerations aks specific maintenance windows/provider constraints openshift connect red hat openshift clusters for cluster operations visibility troubleshooting and compliance workflows drift and gitops remediation support document openshift api/auth specifics scc/rbac considerations openshift route/operator compatibility notes rancher / rke (if applicable) if supported, document whether kubegrade connects directly to downstream clusters via rancher managed contexts to rke/rke2 clusters with standard kubernetes apis include any version limitations self managed clusters kubegrade can support upstream/self managed kubernetes clusters where agent installation is allowed required rbac is granted network connectivity to kubegrade endpoints exists (saas/hybrid) document supported distributions and known constraints iac & config terraform use terraform integration to compare declared infrastructure/kubernetes config to runtime state generate prs for remediation or upgrade prep changes preserve auditable workflows via git your docs draft explicitly references terraform support and imports/integrations ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) helm use helm integration to inspect chart values and manifests detect value drift or incompatible settings generate prs against values files / chart config kustomize use kustomize integration to parse overlays and environment specific manifests compare desired state vs live cluster generate patch updates via prs gitops argo cd kubegrade integrates with argo cd to map git changes to cluster sync behavior track application sync/health after pr merge route remediation through existing gitops controls your docs draft references argocd support ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) flux kubegrade integrates with flux for git driven remediation execution rollout tracking after merge drift reconciliation workflows in gitops first environments ci/cd github actions use github actions with kubegrade to trigger scans/checks on prs validate generated remediations enforce policy checks before merge gitlab ci use gitlab ci to run kubegrade checks in pipeline stages validate environment readiness gate merges with policy/drift/upgrade checks jenkins use jenkins to trigger kubegrade workflows from existing enterprise pipelines ingest outputs into release processes link remediation and upgrade checks into change management flows your docs draft mentions jenkins, github actions, and gitlab ci integration ecosystem ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) observability & logging prometheus prometheus can provide metrics context for troubleshooting investigations upgrade risk/health validation alert correlation post change verification your docs draft references prometheus exports/integration ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) grafana grafana integration can support linking dashboards to kubegrade investigations visual context during incident analysis shared operational dashboards datadog datadog integration can support alert and monitor context incident signal enrichment post remediation verification checks your docs draft mentions datadog support/export ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) (others you support) create a generic pattern page for additional integrations documenting authentication scope/permissions data ingested supported workflows limitations troubleshooting chatops & ticketing slack slack integration can be used to send workflow notifications route approval requests share incident summaries link directly to kubegrade investigations/prs microsoft teams teams integration can provide notifications for alerts/workflows approval routing (if supported) incident timeline sharing jira jira integration can support ticket creation from incidents/drift findings status updates with pr/execution links traceability between platform ops and change records pagerduty pagerduty integration can support incident/event ingestion ownership and escalation routing linking pagerduty incidents to kubegrade investigations webhooks webhook integration supports custom automation and system interoperability document event types payload schema signing/authentication retry behavior ip allowlisting requirements identity & access sso overview kubegrade supports centralized identity integration for enterprise access control (where enabled) use sso to centralize authentication enforce mfa and org policies manage user lifecycle and access consistency your draft docs reference external identity providers and access controls ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) saml / oidc document supported modes and setup steps saml include metadata exchange acs url / entity id group/role mapping certificate rotation oidc include issuer url client id/secret redirect uris claims mapping group/role mapping role mapping best practices map idp groups to kubegrade roles (owner/admin/developer/viewer) scope groups by workspace/project where possible keep production approval roles restricted administration user management admins/owners can invite users remove users assign roles manage access scope by workspace/project/cluster recommended documentation invite flow pending invites deactivation behavior access review process team/role management document how to create teams assign default roles map teams to clusters/projects enforce separation of duties (view vs approve vs execute) workspace/project settings document settings such as naming and metadata cluster associations default policies notification settings integration mappings environment labels and conventions policy management admins should be able to create and update policies assign policies by scope test policy impact (dry run) review policy violations and exceptions track policy changes in audit logs audit logs provide admin guidance for searching/filtering logs exporting logs retention and storage settings linking logs to incidents/workflows/prs your draft docs explicitly mention compliance audit logs ( archbee https //app archbee com/public/preview ns vufbg6zzbf6kxonspk/preview s1v5qyzjfbxmue ifejd6 ) billing (if relevant in docs) if exposed seats/users cluster based billing units usage dimensions (if any) invoice access billing contacts plan changes and renewals if not public yet, keep minimal and note “contact support/sales ” reference configuration reference this page should be a canonical reference for global settings cluster connection settings agent configuration network/proxy settings policy defaults integration configuration keys format recommendation option name type default required? example notes agent crds / manifests reference if kubegrade installs crds or custom resources, document crd names and versions spec fields status fields examples rbac implications if no crds are exposed yet, document install manifests/helm values instead api reference document authentication base urls (saas/on prem) rate limits endpoints by resource (clusters, workflows, policies, audit logs, integrations) webhook subscription endpoints error codes and pagination webhooks reference document webhook events and payloads recommended fields event type event id timestamp org/workspace/project cluster/environment severity (if applicable) resource references links (ui, pr, incident) also document signature verification retries idempotency guidance limits & quotas publish practical limits to reduce support friction examples max clusters per workspace (plan dependent) max concurrent workflow runs max integrations per workspace retention periods payload size limits api rate limits release notes / changelog use a consistent format date version new improved fixed breaking changes migration notes (if needed) include module tags (upgrade, troubleshooting, drift, ai agents, gitops, integrations) faq security faq does kubegrade execute changes directly in my cluster? kubegrade supports controlled execution models and is designed to keep humans in the loop in gitops workflows, changes are proposed as prs and applied through your gitops system after review/merge ( kubegrade https //kubegrade com/ ) can kubegrade run in private or regulated environments? yes, kubegrade publicly positions support for cloud, on prem, and hybrid environments exact deployment architecture depends on your setup and requirements ( kubegrade https //kubegrade com/ ) what permissions does the agent need? permissions depend on enabled modules and whether you use read only or execution enabled workflows start with least privilege and expand only as needed does kubegrade collect secrets? kubegrade should not collect secret values unless explicitly required for a configured integration secret handling behavior must be documented by deployment mode and integration pricing/billing faq (optional) how is kubegrade priced? document your actual pricing model only (e g , by cluster, node, seat, or enterprise plan) if still evolving, say so and direct to sales is on prem pricing different from saas? typically yes, because support, deployment complexity, and infrastructure requirements differ publish only confirmed commercial policy are there limits on clusters or users? list plan based limits here and link to the limits & quotas page troubleshooting faq why is my cluster showing as disconnected? common causes agent not running network egress blocked tls/certificate trust issue invalid credentials/token proxy misconfiguration why can’t kubegrade generate a pr? common causes git repo not connected insufficient repo permissions path mapping not configured policy blocked pr generation unsupported repo layout edge case why is drift reported for fields we don’t manage? you may need exclusions/suppression rules for provider managed or runtime generated fields why is execution blocked after approval? a second gate may still apply maintenance window closed policy check failed on re validation gitops sync restrictions environment freeze window