GPU util hit 97% with one scheduler change

ACRE, LLM-d, and Kueue: GPU Infrastructure Developments from KubeCon

  • NVIDIA ACRE (AI Cluster Runtime) open-sourced — a CLI + self-hosted REST API producing version-locked, validated GPU cluster recipes that encode OS, accelerator type, workload intent, and the dependency graph of 15+ components with 250+ configuration values (switching training → inference swaps 5 components and 40+ settings). Bundles are SLSA-attested at generation and verifiable months later; Argo CD and Helm deployers, CNCF AI Conformance PR generation, and OCI image output ship out of the box — seeded with H100 and GB200 recipes for EKS and GKE, with Azure and Oracle contributing the first week (ACRE on GitHub).
  • LLM-d prefill/decode disaggregation is production-ready in its first public architecture walkthrough (CNCF Sandbox donation first covered in Issue #1): Mistral AI's DisaggregatedSetOperator coordinates rolling updates between prefill and decode worker sets so only compatible versions communicate. KV-cache-aware prefix routing in the LLM-d gateway routes prompts to GPUs already holding matching context, eliminating round-robin cache thrashing and cutting TTFT (KubeCon keynote).
  • Wave's GPU utilization jumped from 85% to 97% across 1,000+ GPU node clusters on Azure (100K concurrent workloads at peak) after adopting Kueue for AI inference scheduling; small-team wait times fell 70%, with full production deployment in under one month and zero workload code changes. Architectural mechanism: kube-scheduler only sees pods Kueue has admitted — not the full pending set — preventing scheduler degradation during high-churn bursts (Kueue at scale keynote).

Crow, Cedar, and MCP Lifecycle Operator: New Kubernetes Platform Primitives

  • Crow — a SIG cloud provider sub-project co-built by AWS, Google Cloud, and Azure — auto-creates CRDs and manages multi-resource composition without custom operators or webhooks: declare a schema and resource list, Kubernetes handles reconciliation. Approaching 1.0 with a new SAP maintainer. The same keynote previewed Cedar conditional authorization in Kubernetes: a new authorizer collapsing the current RBAC + admission policy two-step into a single Cedar policy using per-object attribute conditions (KubeCon keynote).
  • kubernetes-sigs/mcp-lifecycle-operator v0.1.0 released under SIG Apps — the first kubernetes-sigs operator for MCP Server lifecycle. The MCPServer CRD (mcp.x-k8s.io/v1alpha1) creates a Deployment + Service and exposes the cluster-internal endpoint in status.address.url. Multi-arch (amd64, arm64, s390x, ppc64le); requires Kubernetes v1.28+ (kubernetes-sigs/mcp-lifecycle-operator).

AKS April 2026: H100 MIG and KSCR Default, Ubuntu 22.04 Retirement Dates Set

  • 10 AKS features reached GA in April: MIG profiles for H100 partitioning, VPA with Recreate update mode, Istio gateway proxy pods, Premium SSD v2 built-in StorageClass, AKS Managed API Server Guard, and Azure Monitor for Arc-enabled Kubernetes. KSCR enabled by default (previously opt-in); Azure CNI Powered by Cilium adds a new cilium-fluent-bit sidecar — a behavioral change without a feature flag (AKS April newsletter).
  • Ubuntu 22.04 node images officially retired: new nodes blocked after June 30, 2027; scaling and remediation fail after April 30, 2028. Migrate to Ubuntu 24.04. Azure Linux 2.0 already reached EOL on March 31, 2026; Azure Linux 3.0+ is the current supported path (AKS retirement issue).

Azure Linux Rebasing on Fedora; Terraform 1.15 Stable Imminent

  • Microsoft is reportedly rebasing Azure Linux on Fedora, confirmed via Fedora ELN SIG meeting logs: "Azure wants to rebase Azure Linux more or less on Fedora and they need x86_64-v3 for performance." The proposal targets x86_64-v3 packages for Fedora 45 (mandating AVX2, FMA, BMI2 — projecting 10–40% perf gains for cloud workloads); Microsoft has offered compute donations for Fedora build infrastructure. No AKS timeline confirmed (Phoronix, April 25).
  • Terraform v1.15.0-rc3 adds Windows ARM64 production builds — three clean release candidates with no blocking issues; stable is imminent. RC1 features (dynamic prevent_destroy, deprecated on variables/outputs, convert() function, S3 backend AWS Identity Center auth) were covered in Issue #7 (Terraform releases).

Get Platform and Infra Briefing in your inbox

Subscribe to receive new issues as they're published.