Fixed Capacity Spatial Partition, FCSP : GPU Resource Isolation Framework for Multi-Tenant ML Workloads

Dec 3, 2025 | By Bud Ecosystem

GPU sharing in multi-tenant cloud environments requires efficient resource isolation without sacrificing performance. We present FCSP (Fixed Capacity Spatial Partition), a user-space GPU virtualization framework that achieves sub-microsecond memory enforcement and deterministic compute throttling through lock-free data structures and hierarchical token bucket rate limiting. Unlike existing solutions that rely on semaphore-based synchronization, FCSP employs C11 atomics with cache-line-aligned structures to eliminate contention bottlenecks. Our comprehensive evaluation using the GPU-Virt-Bench benchmark suite demonstrates that FCSP achieves 1000X faster context creation (78μs vs. 84ms), 3600X faster memory limit enforcement (0.3μs vs. 1.1ms), and 3X better multi-tenant isolation compared to HAMi-core, the current state-of-the-art open-source GPU sharing solution. For large language model (LLM) inference workloads, FCSP enables 2X higher tenant density while maintaining <5% performance degradation targets, translating to potential infrastructure cost savings of $14M annually for a 1000-GPU deployment.


The proliferation of GPU-accelerated machine learning workloads has created unprecedented demand for efficient GPU resource sharing in cloud environments. Modern data centers deploy thousands of high-end GPUs (NVIDIA A100, H100) to serve diverse workloads including large language model (LLM) inference, training, and batch processing. However, the high cost of these accelerators ($10,000-$40,000 per unit) necessitates maximizing utilization through multi-tenancy.

Hardware based virtualization: Nvidia’s Multi Instance GPU (MIG) splits GPUs into as many as seven isolated instances with dedicated memory and compute resources, providing strong hardware-level isolation but requiring a full GPU reset to reconfigure. NVIDIA vGPU offers hypervisor-based GPU virtualization for virtual machines but requires enterprise licensing and adds significant overhead for containerized workloads. SR-IOV enables PCI passthrough for VMs but is limited to virtual machine environments and is incompatible with containers.

Software based virtualization: Time-slicing methods like NVIDIA’s default multi-process service (MPS) does not provide memory isolation and its fairness depends on application behavior. HAMi/HAMi-core with LD_PRELOAD-based interception with semaphore-coordinated shared memory is an industry-standard open-source virtualization method. KubeShare is a Kubernetes-native GPU sharing method with a similar interception approach.

Limitations of Existing Software-based virtualization methods

Our analysis of HAMi-core reveals fundamental architectural limitations:

Contention Bottleneck

HAMi-core uses a single POSIX semaphore to protect the shared memory region. Under multi-tenant load, this creates severe contention:



This 94ms P99 latency is catastrophic for real-time inference workloads with <100ms SLA requirements.

O(N) Process Scanning

Every memory allocation triggers a linear scan of all process slots to calculate aggregate usage:



With N=1024 process slots, this scan dominates allocation latency.

Feedback-Driven Rate Limiting

HAMi-core’s compute throttling relies on NVML polling to adjust token refill rates:



This feedback loop introduces latency (up to 120ms) between limit violations and enforcement, allowing temporary oversubscription.

Context Creation Overhead

HAMi-core initialization involves shared region setup, NVML enumeration, and process registration:



The 84ms overhead is negligible for long-running processes but prohibitive for serverless and auto-scaling deployments where cold starts must complete in <500ms.

Fixed Capacity Spatial Partition

Fixed Capacity Spatial Partition (FCSP) presents a software-based GPU resource isolation framework designed for the unique requirements of modern ML workloads. The key features include;

  1. Lock-free shared memory architecture: A novel inter-process coordination mechanism using C11 atomics that eliminates the P99 latency spikes (94ms) observed in semaphore-based approaches.
  2. Hierarchical per-stream rate limiting: A two-tier token bucket algorithm that provides per-stream compute isolation while maintaining device-level fairness guarantees.
  3. Crash-resilient process management: A heartbeat-based reaper pattern that automatically recovers resources from crashed processes without requiring administrator intervention.
  4. Stream-aware throttling with NCCL bypass: Intelligent workload classification that preserves collective communication performance while enforcing compute limits on regular kernels.

FCSP is architected based on the following design goals

  1. Sub-microsecond memory enforcement: Memory limit checks must complete in <1μs to avoid impacting allocation-heavy workloads like KV cache management.
  2. Lock-free hot paths: Common operations (allocation tracking, kernel throttling) must not acquire locks, enabling O(1) latency regardless of tenant count.
  3. Deterministic compute limiting: Rate limiting should use a predictable mathematical model rather than feedback control, eliminating limit violation transients.
  4. Crash resilience: Process failures must not leak resources or corrupt shared state.
  5. LLM workload optimization: Special handling for attention patterns, KV cache allocation, and NCCL collective operations.

FCSP Architecture

FCSP operates as a shared library loaded via LD_PRELOAD that interposes on CUDA Driver API and NVML calls. Figure 1 illustrates the architecture:

FCSP is implemented as a user-space interposition layer that sits between ML applications (e.g., PyTorch, vLLM) and the NVIDIA CUDA driver. The architecture is intentionally layered so that no kernel modules, driver modifications, or hardware partitioning are required. Instead, FCSP enforces multi-tenant GPU isolation by intercepting GPU API calls, applying policy decisions, and then forwarding allowed operations to the native CUDA stack.

At a high level, GPU usage flows through the system as:

Application → CUDA Runtime (libcudart.so) → FCSP (libvgpu_v2.so) → CUDA Driver (libcuda.so) → NVIDIA GPU

FCSP is typically injected using LD_PRELOAD, enabling it to transparently hook relevant CUDA/NVML entry points used by frameworks and inference servers without requiring application code changes.

Application Layer

This layer contains the tenant workloads—training/inference processes that allocate device memory, create streams, and launch kernels. In a multi-tenant environment, these processes are not mutually aware and may compete aggressively for GPU memory and compute, resulting in unpredictable interference without an enforcement layer.

CUDA Runtime (libcudart.so)

The CUDA runtime library is the conventional interface used by most frameworks. It provides high-level API behavior (e.g., cudaMalloc, stream creation, synchronization) and eventually routes work to the CUDA driver. FCSP does not replace the runtime; instead it interposes beneath it, ensuring all runtime-driven GPU actions are still subject to policy enforcement.

FCSP Interposition Layer (libvgpu_v2.so)

FCSP is the control plane for GPU sharing. It intercepts critical GPU operations and applies resource isolation logic before allowing them to proceed. The FCSP layer is composed of four cooperating modules:

1) Memory Tracker

The Memory Tracker provides fast, deterministic memory accounting across all tenant processes on the same node. It intercepts allocation/free operations and maintains:

  • per-process memory usage,
  • per-GPU global usage,
  • allocation metadata for correct deallocation attribution.

Enforcement is designed for high concurrency: allocations are admitted or rejected using atomic updates rather than global locks, preventing contention-driven latency spikes.

2) Kernel Rate Limiter

The Kernel Rate Limiter enforces compute-level isolation by controlling kernel launch admission. Instead of relying on slow utilization feedback loops, FCSP uses a rate-based model (e.g., token buckets / hierarchical token buckets) to regulate the rate at which kernels are issued onto the GPU. This converts “best-effort sharing” into a predictable mechanism that limits noisy neighbors and stabilizes tail latency under dense tenancy.

3) Stream Classifier

GPU streams represent independent submission queues, and different streams often correspond to fundamentally different kinds of work (compute, copies, collectives). The Stream Classifier identifies and labels important stream categories—especially communication-focused streams such as NCCL—so FCSP can apply appropriate policies. For example, NCCL streams may be excluded or lightly treated by compute throttling to avoid inducing distributed synchronization collapse (where throttling one rank’s collectives degrades the entire job).

4) Process Manager

The Process Manager controls tenant slot lifecycle and robustness:

  • registers processes into shared accounting slots,
  • maintains liveness via heartbeats,
  • reclaims resources when a process crashes or is killed.

This prevents “ghost allocations” and stale state from permanently degrading GPU capacity after unexpected failures.

Lock-Free Shared Memory Region (Coordination Backbone)

All FCSP modules coordinate via a lock-free shared memory region that is mapped into every participating process:

  • mmap’d: shared across processes for a consistent node-wide view of resource consumption
  • mlock’d: pinned in RAM to avoid paging delays in the enforcement hot path
  • 4KB-aligned: page-aligned for predictable mapping and memory behavior
  • cache-line padded: reduces false sharing and coherence traffic under high update rates
  • lock-free (atomics-based): avoids global semaphores/mutex hot spots, improving tail latency and scaling with tenant count

This region stores the minimal but sufficient global state needed for enforcement:

  • per-GPU memory totals,
  • per-process usage counters,
  • rate limiter state,
  • heartbeat timestamps and slot ownership signals.

Crucially, FCSP is designed so that the common enforcement path remains O(1) per operation (no scanning across all tenants), which is essential when many processes allocate memory and launch kernels concurrently.

CUDA Driver (libcuda.so) and NVIDIA GPU

After FCSP admits an operation, it forwards the call to the native CUDA driver (libcuda.so), which performs the actual device interaction and submits work to the GPU. Because FCSP operates strictly above the driver and hardware, it remains compatible with standard NVIDIA deployments while still providing strong, configurable isolation behavior.

Why This Architecture Works for Multi-Tenant Inference

This design is optimized for practical serving environments where many independent inference workers share a small number of GPUs:

  • Isolation without hardware partitioning: policies are enforced in software, enabling fine-grained tenancy and dynamic reconfiguration.
  • Low overhead under concurrency: lock-free shared memory avoids semaphore contention that commonly causes P99 latency spikes.
  • Predictable compute control: rate limiting at kernel launch time yields stable behavior even when utilization telemetry is noisy or delayed.
  • Crash-safe accounting: liveness tracking and cleanup prevent stale reservations from permanently reducing capacity.

Together, these layers make FCSP a portable, high-performance foundation for GPU resource governance in dense multi-tenant ML systems.

Now, let’s see how these main components are defined and implemented.

Lock-Free Shared Memory Region

The shared memory region is the coordination substrate for all FCSP operations. Unlike HAMi-core’s semaphore-protected region, FCSP uses lock-free data structures throughout.

Memory Layout



Key Design Decisions:

  1. 64-byte cache-line alignment: Every frequently-accessed atomic field is padded to 64 bytes to prevent false sharing on multi-socket systems.
  2. Separated heartbeat array: Heartbeat timestamps are stored separately from process slots because they’re updated by a dedicated thread at 1Hz, while process slots are updated on every allocation.
  3. Running totals: Per-device total_memory_used counters are maintained atomically, eliminating the O(N) scan.

Memory Tracking Protocol



Memory Ordering Analysis:

  • Step 2 (relaxed load): Only a hint; stale values are acceptable because step 4 rechecks.
  • Step 3 (acq_rel): Acquire ensures we see all prior allocations; release publishes our increment.
  • Step 4 (recheck): Handles the TOCTOU race between steps 2 and 3.

This protocol guarantees that the sum of all allocations never exceeds the device limit, even under concurrent allocation storms.

Thread-Local Slot Caching



The __builtin_expect hints enable branch prediction optimization, reducing the hot-path to 2 CPU cycles (TLS dereference + comparison).

Hierarchical Per-Stream Rate Limiting

FCSP implements a two-tier rate limiting architecture that balances stream isolation with device-level fairness.

Token Bucket Algorithm



Dampening Factor Rationale: Our benchmarks revealed that slight under-throttling (90% of true cost) improves multi-stream efficiency by creating natural synchronization barriers. This counter-intuitive result motivated the configurable dampening factor.

Per-Stream Bucket Structure



Exponential Backoff Wait

Unlike HAMi-core’s sched_yield() approach that incurs 1-2μs context switch overhead, FCSP uses busy-wait with exponential backoff:



The PAUSE instruction signals to the CPU that we’re spinning, enabling power savings and improved SMT scheduling.

Stream Classification and NCCL Bypass

FCSP classifies streams to apply workload-appropriate throttling:



NCCL Detection

NCCL streams are detected by hooking ncclCommInitRank:



Classification-Aware Throttling



Crash Recovery via Heartbeat Reaper

FCSP implements automatic resource recovery for crashed processes:

Heartbeat Thread

Each FCSP-enabled process runs a lightweight heartbeat thread:



Reaper Thread (Single Instance per Node)

A dedicated reaper process (or daemon) monitors heartbeats:



Atomic Slot Cleanup



The sentinel value (-2) prevents race conditions where another process could claim the slot during cleanup.

Implementation

Hook Registration : FCSP hooks CUDA functions by intercepting dlsym:



Hooked Functions : FCSP hooks 47 CUDA Driver API functions and 12 NVML functions:

Memory Management (18 functions): – cuMemAlloc_v2, cuMemAllocManaged, cuMemAllocPitch_v2 – cuMemAllocAsync, cuMemAllocFromPoolAsync – cuMemFree_v2, cuMemFreeAsync – cuMemGetInfo_v2, cuMemcpy* family

Kernel Launch (6 functions): – cuLaunchKernel, cuLaunchKernelEx – cuLaunchCooperativeKernel, cuLaunchCooperativeKernelMultiDevice – cuGraphLaunch, cuGraphLaunchPipelined

Device Management (12 functions): – cuDeviceGet*, cuCtxCreate*, cuCtxDestroy*

Stream Management (8 functions): – cuStreamCreate*, cuStreamDestroy, cuStreamSynchronize

NVML (12 functions): – nvmlDeviceGetMemoryInfo, nvmlDeviceGetUtilizationRates – nvmlDeviceGetCount*, nvmlDeviceGetHandleBy*

Allocation Hash Map: FCSP uses a lock-free hash map for tracking individual allocations:



Hash Function: FNV-1a on pointer value, reduced to 12 bits.

Lookup: Lock-free traversal with acquire semantics on next pointer loads.

Insert: CAS on bucket head with tagged pointers for ABA prevention.

Batch Accounting Optimization : To reduce atomic operation frequency, FCSP accumulates allocation deltas locally:



This reduces global atomic operations by 10-100× for allocation-heavy workloads.

Conclusion,

FCSP demonstrates that high-density GPU sharing can deliver strong multi-tenant isolation without sacrificing latency-sensitive performance. By replacing semaphore-coordinated shared memory with C11-atomic, cache-line-aligned lock-free structures, FCSP removes contention hot spots and enables sub-microsecond memory enforcement at scale. Its deterministic, hierarchical token-bucket throttling further stabilizes compute fairness without feedback-loop lag, while stream-aware policies (including NCCL bypass) preserve critical communication paths for distributed workloads. Together, these design choices unlock higher tenant density with predictable degradation targets and improved operational resilience. In the next article, we’ll explore a detailed performance comparison of FCSP against HAMi across benchmarks and real LLM inference workloads.

Bud Ecosystem

Our vision is to simplify intelligence—starting with understanding and defining what intelligence is, and extending to simplifying complex models and their underlying infrastructure.

Related Blogs

How to Build vLLM Plugins: A comprehensive Developer Guide with tips and best practices
How to Build vLLM Plugins: A comprehensive Developer Guide with tips and best practices

Building plugins for vLLM allows you to tailor the system to your specific requirements and integrate custom functionality into your LLM workflows. Whether you’re looking to integrate custom functionality, optimize performance, or streamline deployment, understanding how vLLM’s plugin system works is essential. In this comprehensive developer guide, I’ll break down the core concepts, walk through […]

Virtualised Hardware is The Missing Layer for Scalable AI-in-a-Box Systems
Virtualised Hardware is The Missing Layer for Scalable AI-in-a-Box Systems

AI-in-a-Box appliances have become the preferred choice for enterprises that need GenAI to run on-premises, within air-gapped environments, or under strict physical control. But as organizations scale AI, they often hit the same roadblock where each use case ends up needing its own dedicated system, every model appears to require its own GPU, and every […]

Introducing GPU-Virt-Bench: An Open-Source Framework for Benchmarking GPU Virtualization
Introducing GPU-Virt-Bench: An Open-Source Framework for Benchmarking GPU Virtualization

We just open-sourced GPU-Virt-Bench, a comprehensive benchmarking framework for evaluating software-based GPU virtualization systems like HAMi-core, BUD-FCSP, and comparing against ideal MIG behavior. It evaluates 56 metrics across 10 categories. 👉 GitHub : GPU-Virt-Bench Why Benchmark GPU-Virtualization Systems? When several applications or tenants try to run on the same GPU, the system can become unstable […]

Heterogenous GPU Virtualisation in Bud AI foundry
Heterogenous GPU Virtualisation in Bud AI foundry

Most enterprises don’t have a GPU performance problem—they have a GPU wastage problem. Clusters packed with A100s and H100s routinely run GenAI workloads at a fraction of their capacity, burning budget on idle VRAM, unused compute, and over-provisioned “just in case” headroom. The result is quiet but massive leakage in AI infrastructure spend, especially in […]