K8s resize without restart, $5K/cluster hidden fees

EKS Cluster Governance: 7 IAM Condition Keys for Proactive Policy Enforcement

  • Amazon EKS added 7 new IAM condition keys scoped to CreateCluster, UpdateClusterConfig, UpdateClusterVersion, and AssociateEncryptionConfig — enabling SCP and IAM policy enforcement of cluster security posture at creation time rather than post-deployment audit.
  • Key enforceable constraints: eks:endpointPublicAccess (block public API endpoints), eks:encryptionConfigProviderKeyArns (require specific CMKs for secrets encryption), eks:kubernetesVersion (allowlist approved versions), plus eks:deletionProtection, eks:controlPlaneScalingTier, and eks:zonalShiftEnabled per the announcement.
  • The keys compose with AWS Organizations SCPs — a single deny SCP on eks:encryptionConfigProviderKeyArns blocks unencrypted cluster creation across all member accounts without per-account IAM configuration (announcement).

KServe's LLMInferenceService CRD Abstracts Prefill/Decode Topology

  • KServe introduced a LLMInferenceService CRD (~6 months in development) that generates the full disaggregated inference stack from a single spec — endpoint picker, inference scheduler, prefill pods, and decode pods — with NIXL (NVIDIA) managing NVLink/RoCEv2 inter-pod communication; vLLM 0.6 is the primary tracking target.
  • A WVA (Workload Variant Autoscaler) CRD is in development — a single object that centrally manages both KEDA and HPA with pluggable actuators and scaling signals beyond KV cache utilization, replacing per-workload autoscaler configuration (KubeCon EU roadmap).
  • LeaderWorkerSet (LWS) support replaces StatefulSet for grouped multi-node, multi-GPU worker deployments; KServe is also transitioning autoscaling from KNative to KEDA, as KNative's eventing model doesn't map cleanly to LLM inference load patterns (KubeCon EU roadmap).

K8s v1.36 Post-Stable Guides: Pod-Level Vertical Scaling Beta and Velero etcd Tooling

  • The Kubernetes blog published a pod-level vertical scaling operations guide for InPlacePodLevelResourcesVerticalScaling (beta, on by default in v1.36) — CPU and memory now resize at the pod level without restart, applying changes across all containers simultaneously; distinct from container-level in-place resize (KEP-1287) promoted in v1.27.
  • A second v1.36 guide covers mutable pod resources for suspended Jobs (beta) — node selectors, tolerations, and resource requests can now be modified in-place without deleting and recreating the Job object (briefly noted in Issue #7's preview; this is the full operational breakdown).
  • Alongside the Velero CNCF sandbox donation (Issue #1), Broadcom also published etcd diagnosis and recovery tooling at github.com/vmware/etcd-diagnosis and github.com/vmware/etcd-recovery — providing structured control-plane visibility and recovery automation independent of backup tooling per InfoQ's Velero deep dive.
  • Velero's post-donation community roadmap directions include a multi-cluster backup policy control plane, CSI Data Management spec integration for pre-snapshot application quiescing, and Sigstore-signed backup artifacts; the Broadcom/Red Hat/Microsoft maintainer group adopted 5-day lazy-consensus voting for governance (InfoQ).

Rancher Prime 2.13.5: Revert Chart Name to rancher, Two CVEs Patched

  • Rancher Prime 2.13.5 reverts the chart name change from v2.13.1 — all Helm install and upgrade commands must use helm install rancher rancher-prime/rancher; automation using the v2.13.1 chart name will break on upgrade per the release notes.
  • CVE-2026-25705 (path traversal enabling arbitrary file access in Rancher Extensions) and CVE-2026-41050 (Fleet Helm deployer bypassing ServiceAccount impersonation, exposing unauthorized access to Kubernetes secrets) are both patched in this release (release notes).
  • S3 snapshot retention silently resetting to 5 on RKE2/K3s cluster version upgrades is fixed. Bundled Kubernetes versions: v1.34.7 (default), v1.33.11, v1.32.13 (release notes).

Q1 2026 Cloud Infrastructure: Google Cloud +63%; Kubernetes Support Costs Shift Private Cloud Math

  • Google Cloud reached $20B in Q1 2026 revenue, up 63% YoY from $12.2B, with $6.6B operating income (203% YoY profit growth); AWS posted $37.6B at 28% YoY with $14.2B operating income; Azure Intelligent Cloud hit $34.7B at 28% YoY with Azure-specific services growing 39% per the CRN earnings face-off.
  • The global cloud infrastructure market hit $129B in Q1 2026, up $35B YoY — all three providers cite AI workload demand as the primary growth driver (CRN).
  • A ReveCom analysis documents a "lag gap" of 2–7 months between CNCF upstream releases and platform GA — VCF releases within ~2 months; RHOS lags up to 6 months due to OS vertical integration. Support windows: VCF 24 months (no extra cost), RHOS 18 months, EKS/GKE 14 months, AKS 12 months.
  • Hyperscaler extended support fees exceed $5,000/cluster/year — a key cost driver alongside Gartner's forecast of 20% workload shift from public hyperscalers to local/private providers by 2026 and $80.4B in sovereign cloud spending for 2026 (Cloud Native Now).

Get Platform and Infra Briefing in your inbox

Subscribe to receive new issues as they're published.