The GPU multitenancy mess

We’re seeing an interesting infrastructure tug of war today where GPU clouds are being pulled in two directions. For the economics of AI to work, the enterprise market needs to carve expensive hardware into smaller, shareable units and hand it to customers on demand, similar to how CPUs are doled in public cloud infrastructure. But the more the providers push GPUs to behave like elastic cloud infrastructure, the more they run into the reality that this GPU hardware was never built for safe multitenant use, fast fault recovery, or clean isolation between workloads. That tension is becoming one of the defining operational problems of the AI infrastructure market.

When a gamer launches Steam or the Epic Games Store on their laptop, they don’t have to worry about which GPU is being scheduled, how memory is going to be divided, or really any of the security boundaries or hardware assignment issues on their PC. For consumer PCs, these issues are not just hidden from view, they are irrelevant.

But for today’s IT teams managing GPU-driven AI workloads across distributed systems, those types of allocations and partitions need to be managed manually and carefully. This includes deciding which GPU to assign to each workload, how to divide memory, how to isolate tasks, and how to maximize utilization of this very expensive hardware. That’s why you heard so much about “Day 2” AI infrastructure themes around Nvidia’s GTC event this year.

The legacy hardware bottleneck

GPUs were originally developed to speed up graphic rendering and to perform local compute via shaders in service of graphical rendering. Their design assumes a trusted computing environment in which a single application controls the device. When a user runs an application on a GPU, it accelerates that particular workload.

But while GPUs are optimized for throughput, and therefore have thousands of simple cores designed to execute the same instruction over large datasets, this design paradigm creates several major technical limitations regarding context switching and memory isolation. GPUs were designed to produce pixels, not to run sensitive AI applications from multiple tenants using the same hardware.

As a result, GPU infrastructure today behaves less like elastic cloud infrastructure and more like carefully managed hardware appliances.

The partitioning paradox

Today’s AI infrastructure requires GPUs to behave like shared, elastic cloud resources. As inference workloads begin to outstrip large-scale training runs, the ability to slice and share expensive hardware among multiple tenants in real time while maintaining acceptable fault tolerance is no longer optional.

Hardware vendors have introduced new approaches to dividing GPUs into multiple isolated compute slices. Other frameworks approach partitioning through schedulers and container runtimes. But resource partitioning is just one slice of the overall GPU operations pie.

There is currently no widely adopted, cross-vendor operating model for achieving this safely at scale. Most providers are faced with either dedicating a single customer to a physical machine (thus wasting available capacity), or accepting the multi-tenancy security risks currently without a known solution. The current engineering challenge has moved from beneath the model layer and into the infrastructure layer. Success now relies on the ability to quickly launch new workloads and to rapidly contain hardware faults so that a single GPU failure does not bring down all workloads running on the server.

Untrusted code and tenants

The vast majority of current GPU programming models rely on the idea that the driver has complete control over memory protection and that no user will act maliciously. Unfortunately, that assumption completely falls apart in a cloud environment where one VM or container can leave behind data remnants in memory that another VM or container may access. Especially considering that how GPUs execute code is often completely opaque.

A single faulty workload or a single faulty driver failure in a shared GPU environment can also take down every workload (job) that was being run on the same server, increasing the amount of damage caused by an operational failure.

Currently, there are immature options for runtime inspection or behavioral auditing, limiting both visibility and control for security teams. GPU drivers provide a large attack surface and generally limited telemetry from the hardware. In these shared environments, embeddings, weights, prompts, and tokens are now all sensitive data points, creating significant blind spots for those attempting to protect intellectual property.

The high cost of cold starts

The real constraint in many GPU clouds is not model performance but operational efficiency. Right now GPU operations looks like 30 minute tenant spin-up times, 70% idle rates, and engineers continuously debugging infrastructure stacks. Today’s GPU clouds are stalled not due to inferior models, but because the infrastructure layer underpinning these clouds was never designed to support such a high degree of scale.

A 30 minute cold start is a fundamental limitation on the modern AI business model. Those GPU clouds that can spin up workloads in seconds will ultimately win against those that do not. Multitenancy is the only viable means of producing sufficient unit economics to make this very expensive hardware viable for the long term.

Bridging the orchestration gap

Platform teams are beginning to recognize that GPUs require a specialized operating layer between the hardware and workloads. Operators need a unified operating model that supports multiple hardware vendors and GPU models. Cloud providers need a method to safely slice and share servers among tenants, and to prevent cascading failures, while rapidly launching new workloads.

Enterprises are increasingly seeking to run sensitive AI applications with stronger isolation guarantees, and thus are turning to newer categories of software designed specifically to manage the “dirty work” of hardware orchestration.

The race to make GPUs more operational

Prior to Kubernetes standardizing container orchestration, the industry was constantly debating the efficiency of container scheduling and bin packing across clusters. Those operational concerns were eventually automated and incorporated into infrastructure layers that made the complexities invisible to the end user. 

A similar evolution is occurring around GPUs today. While platform teams continue to argue over placement strategies and memory tuning, these decisions will likely be automated within five to 10 years. As AI infrastructure evolves, the most valuable layer may not be the GPUs themselves but the operating layers that make them secure, elastic, and efficiently sharable. So the winners in the AI race won’t necessarily be just those with the most silicon, but those who have the best operating models for making that silicon secure and elastic.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

Donner Music, make your music with gear
Multi-Function Air Blower: Blowing, suction, extraction, and even inflation

Leave a reply

Please enter your comment!
Please enter your name here