Kubernetes 1.36: Fix Your 95% GPU Waste Problem

Your Kubernetes Cluster Is Wasting 95% of Its GPU Budget

Kubernetes 1.36 and new multi cloud moves from AWS and Google are rewriting AI infrastructure economics.

Enterprises are spending billions on AI infrastructure, and new data shows 95 cents of every dollar spent on GPU compute is going to waste. Here is what has changed, what the data shows, and what you need to do about it before your next budget cycle.

The GPU Utilization Catastrophe Nobody Is Talking About

Cast AI's 2026 State of Kubernetes Optimization Report landed with a number that should alarm every executive signing off on AI infrastructure budgets: average GPU utilization on Kubernetes clusters sits at just 5%.

Five percent.

While cloud providers are raising GPU prices, with AWS increasing H200 Capacity Block pricing by 15% in January 2026, enterprises are leaving 95% of that increasingly expensive compute sitting idle. Global AI infrastructure spending hit $82 billion in a single quarter in 2025. The gap between what organizations are spending and what they are using has never been wider.

This is not a technology problem. It is an orchestration problem. And Kubernetes, with Dynamic Resource Allocation now generally available as of version 1.34, is finally providing the tools to fix it at scale.

What Kubernetes 1.34 Changes for AI Workloads

The Kubernetes project graduated Dynamic Resource Allocation to General Availability in version 1.34, and the most consequential changes are squarely aimed at AI and GPU workloads.

For years, Kubernetes treated GPUs as blunt instruments. You either had the whole GPU or you did not. Dynamic Resource Allocation allows the scheduler to understand the specific requirements of a GPU or AI accelerator at a granular level, enabling GPU sharing across multiple workloads, fractional GPU allocation, and multi node AI job coordination that previously required custom operators and significant engineering overhead.

NVIDIA formalized this direction at KubeCon EU 2026 by donating its Dynamic Resource Allocation driver for GPUs to the Kubernetes community. This is a significant signal: the largest GPU vendor in the world is now treating native Kubernetes GPU scheduling as the standard path, not a workaround. Google donated the DRA driver for TPUs at the same event.

Security defaults are also tightening in Kubernetes 1.34. User Namespaces and Mutating Admission Policies reach General Availability. For enterprises running AI workloads, this matters because AI jobs often require elevated privileges that create security exposure. User Namespaces maps container root to an unprivileged host user, eliminating a major attack surface without requiring application changes.

What Production Case Studies Show

Advanced GPU scheduling on Kubernetes can push utilization from 13% to 37%, a near tripling, and some implementations are exceeding 80%.

At enterprise GPU spending levels, the difference between 5% and 40% utilization is measured in millions of dollars annually. A 20 GPU cluster moving from 5% to 37% utilization through DRA implementation, MIG partitioning, and autoscaling recovers approximately $420,000 per year in GPU infrastructure costs without purchasing additional capacity or reducing training throughput.

The organizations achieving 40% and above utilization are running the standard optimization stack: DRA for intelligent allocation, MIG for hardware level partitioning on H100 and H200 GPUs, time slicing for development workloads, and Karpenter or Cluster Autoscaler for dynamic node lifecycle management. The stack is well documented and available on every major managed Kubernetes platform.

The Multi Cloud Layer Is Shifting Too

GPU optimization is not the only structural shift affecting AI infrastructure this year. AWS previewed a new cross cloud connectivity service with Google Cloud as its first launch partner, signaling that the era of hyperscaler walled gardens is cracking open.

Google Cloud Next 2026 unveiled cross cloud caching and a cross cloud data lakehouse built on Apache Iceberg, letting AI agents access data regardless of which cloud it lives on, reducing egress costs and simplifying multi cloud data architectures.

Inference now accounts for roughly two thirds of all AI compute in production environments. Industry data shows inference can represent 80 to 90% of the lifetime cost of a production AI system. The organizations winning on AI economics are the ones optimizing inference infrastructure, and that means getting Kubernetes GPU scheduling right first.

Why GPU Hoarding Happens and How to Stop It

The 5% average utilization is primarily a resource planning and incentive problem. Teams that have experienced GPU scarcity during a critical training run remember the pain of waiting for capacity. The rational response is to reserve more than you currently need to avoid being caught short in the future.

The organizational fix requires addressing both the supply side and the demand side of the hoarding dynamic. On the supply side, fast capacity provisioning through DRA and autoscaling removes the rational basis for hoarding. If teams can get additional GPU capacity within minutes when they need it, the incentive to hold idle capacity disappears. On the demand side, visibility into your own utilization numbers, published regularly to your team, creates accountability that drives behavioral change without requiring policy enforcement.

What Executives Should Do Now

Audit your GPU utilization immediately. If you do not have a number, assume it is near 5%. Run Cast AI, KubeGPU, or a comparable observability tool against your clusters this week.

Evaluate upgrading to Kubernetes 1.34. The DRA enhancements alone justify the upgrade cycle for any organization running GPU workloads. Schedule the migration before Q3 AI budget reviews.

Revisit your GPU procurement strategy. At 5% utilization, buying more GPUs is almost certainly the wrong answer. Scheduling optimization should come before any new capacity purchase.

Build a multi cloud data strategy. The AWS and Google connectivity preview and Google's cross cloud lakehouse are early signals of where the market is heading. Organizations without a multi cloud data architecture are building technical debt today.

Separate training from inference infrastructure. Training and inference have fundamentally different resource profiles. Kubernetes can manage both natively with DRA, but only if your cluster topology and scheduling policies are designed for the distinction.

The Organizational Change Required to Sustain GPU Efficiency

Technical optimization of GPU scheduling is necessary but not sufficient for sustained efficiency improvements. The organizational behaviors that created 5% average utilization in the first place will recreate it on top of any technical improvements if they are not addressed directly.

Platform teams that have successfully sustained GPU efficiency improvements beyond the initial optimization project share three characteristics. They publish weekly utilization reports by team and by project. They have a fast allocation process that removes the rational basis for hoarding. And they have executive sponsorship that frames GPU costs as a shared resource requiring shared stewardship rather than a per-team budget line that teams optimize individually.

The publication cadence matters more than the precision of the metrics. Teams that receive weekly visibility into their own GPU utilization relative to their allocation consistently demonstrate behavioral change within two to three reporting cycles. The visibility creates accountability that policy enforcement struggles to achieve, because engineers respond to data about their own behavior when the data is presented without blame and the path to improvement is clear.

The fast allocation process addresses the root cause of hoarding: the memory of scarcity. Teams that have experienced GPU unavailability during a critical training run will provision defensively for months afterward, regardless of current availability. Organizations that implement SLO-backed GPU allocation — guaranteeing that additional capacity is available within a defined time window — see hoarding behavior disappear faster than those that rely on trust alone.

The Multi Cloud GPU Economics Opportunity

The spot and preemptible GPU instance discounts available across AWS, Azure, and GCP represent a significant cost reduction opportunity for organizations whose training and batch inference workloads can tolerate interruption. Spot GPU instances are available at 60 to 90% discounts versus on demand pricing. For workloads designed with checkpointing and restart capability, the effective GPU cost can be reduced to a fraction of list price. Combined with the utilization improvements from DRA scheduling, the total economics of GPU infrastructure for AI workloads can improve by 80 to 90% relative to the current fleet average without any reduction in training throughput or inference capacity.

Why Kubernetes Cost Reduction Starts with Visibility

Most organizations trying to reduce Kubernetes spend run into the same wall: they lack the per-workload cost attribution needed to justify changes. Without knowing which team or service owns a given set of pods, optimization conversations stall at the team boundary. ITSulu's approach combines open-source tools like Kubecost or OpenCost with custom dashboards that surface per-namespace, per-deployment, and per-team cost data. Once teams can see what their workloads actually cost, rightsizing decisions happen organically. Budget conversations shift from guesswork to evidence-based negotiation. Visibility is not just a reporting function — it is the prerequisite for every other optimization strategy. Organizations that establish cost observability first consistently achieve higher and more durable savings than those who jump straight to autoscaling or spot instances.

How ITSulu Can Help

The gap between having a Kubernetes cluster and having a Kubernetes cluster optimized for AI workloads is where most enterprise value is being lost right now. ITSulu's Automated Kubernetes Operations practice works with infrastructure and platform engineering teams to close that gap: auditing current GPU utilization, implementing DRA based scheduling, hardening security posture to Kubernetes 1.34 defaults, and designing multi cloud connectivity strategies.

If your organization is heading into a budget cycle with AI infrastructure spend that is not performing at the level your investment warrants, that conversation is worth having before the numbers get locked in.

Contact ITSulu today to schedule a consultation.

The GPU economics of AI infrastructure are moving against organizations that delay. Hardware prices are rising, AI model requirements are growing, and the organizations that build efficient GPU utilization practices now will have a meaningful cost structure advantage over those that wait.

in ITSulu Insights

# AI Workloads Cloud Infrastructure GPU Optimization Kubernetes Multi-Cloud

Your WAN Was Built for Humans. AI Agents Don't Care.