Why Use FCSP If GPUs Already Support MIG?

If you’ve ever tried to share a GPU between multiple users or workloads in a Kubernetes cluster, you’ve probably heard of NVIDIA’s Multi-Instance GPU (MIG) technology. It’s the official, hardware-backed solution for GPU partitioning. But what if I told you there’s a compelling software alternative that might actually be better for your use case?

Enter FCSP (Fixed Capacity Spatial Partition) – a sophisticated software-based GPU virtualization library that provides MIG-like multi-tenant isolation without the hardware constraints. In this article, we’ll explore when and why you might choose FCSP over native MIG, backed by real benchmarks and production considerations.

What is NVIDIA MIG?

Multi-Instance GPU (MIG) is a hardware feature introduced with NVIDIA Ampere architecture (A100) and continued in Hopper (H100) GPUs. It allows a single physical GPU to be partitioned into up to 7 isolated instances, each with dedicated:

Memory: Physically isolated memory regions
SMs (Streaming Multiprocessors): Dedicated compute units
Memory Controllers: Guaranteed memory bandwidth

MIG partitions are “hard” – once created, they’re fixed until explicitly reconfigured. Each MIG instance appears as a separate GPU to CUDA applications.

What is FCSP?

FCSP is a software-based GPU virtualization layer that achieves similar multi-tenant isolation through:

LD_PRELOAD interception: Transparent CUDA API interception
Token-bucket rate limiting: Fine-grained compute time allocation
Memory quota enforcement: Hard/soft memory limits per process
Work-conserving scheduling: Unused resources flow to active tenants
Intelligent prefetching: UVM-based memory oversubscription

The key difference: FCSP operates entirely in software, works on any CUDA-capable GPU, and provides dynamic, flexible resource allocation.

The Problem with MIG

MIG is excellent technology, but it comes with significant limitations such as strict hardware requirements, static partitioning, limited partition profiles, and the potential for wasted resources.

1. Hardware Requirements

MIG (Multi-Instance GPU) support is limited to a very small set of NVIDIA’s data center–class GPUs. Specifically, MIG is available only on the A100 (both 40GB and 80GB variants), A30, H100, and H200. These GPUs are designed for large-scale enterprise and cloud environments, where hardware-level isolation and fine-grained GPU partitioning are critical for multi-tenant workloads.

Outside of this narrow group, MIG is not supported. Consumer GPUs do not offer MIG capabilities, and neither do older or lower-tier data center accelerators such as the V100 or T4. Likewise, no RTX-series GPUs—regardless of their performance—can use MIG. In practical terms, this means that MIG is exclusively a feature of NVIDIA’s most expensive, modern data center GPUs, and it is not an option for systems built on consumer or legacy hardware.

2. Static Partitioning

MIG partitions are fixed at creation time. It requires;

No running CUDA processes
MIG mode enabled (requires GPU reset)
Specific, predefined partition profiles

$ nvidia-smi mig -cgi 9,9,9 -C # Create 3 MIG instances
# GPU must be idle, process disruption required

Changing MIG partition sizes is not a simple or flexible operation. To make any adjustments, all running workloads must first be terminated. Existing MIG instances then need to be destroyed and recreated with the new configuration, after which workloads must be restarted. This process is inherently disruptive and introduces operational overhead, making it both time-consuming and costly in production environments.

3. Limited Partition Profiles

MIG doesn’t allow arbitrary resource splits. You’re limited to predefined profiles:

A100-80GB Profile	Memory	SMs	Instances
1g.10gb	10GB	14	Up to 7
2g.20gb	20GB	28	Up to 3
3g.40gb	40GB	42	Up to 2
7g.80gb	80GB	98	1

MIG offers little flexibility in how resources are divided. You cannot create arbitrary splits such as 30%/70%, evenly divide the GPU into five equal instances, or dynamically rebalance resources as workload demands change. You are constrained to a small set of predefined profiles, which often fail to align with real-world usage patterns.

4. Wasted Resources

With MIG’s static allocation, idle instances can’t share resources. For example consider a scenario where 3 MIG instances are present but only one tenant is active. Here, Tenant A limited to 33% even though 67% of hardware resources are idle.

How FCSP Solves These Problems

1. Universal GPU Support

FCSP delivers universal support across virtually all CUDA-enabled GPUs. For example, even an RTX 3080 can now offer MIG-like isolation, while a V100 cluster gains true multi-tenancy—without requiring specialized or next-generation hardware. FCSP-Supported GPUs include;

All Data Center: A100, H100, V100, T4, A10, L4, etc.
All Professional: RTX A6000, RTX 4000/5000/6000, etc
All Consumer: RTX 3080/3090/4080/4090, etc
Legacy: GTX 1080 Ti, Titan V, etc
Minimum: CUDA Compute Capability 3.0+

2. Dynamic Resource Allocation

FCSP partitions can be adjusted dynamically at runtime without disrupting running workloads—no restarts, no service interruption, and instant rebalancing as demands change.

# Change tenant limits on the fly
$ export BUD_SM_LIMIT=50    # 50% compute
$ export BUD_MEMORY_LIMIT=8G # 8GB memory

# Or per-device
$ export BUD_SM_LIMIT_DEV0=30
$ export BUD_SM_LIMIT_DEV1=70

# Changes take effect immediately for new allocations

# Change tenant limits on the fly

$ export BUD_SM_LIMIT=50 # 50% compute

$ export BUD_MEMORY_LIMIT=8G # 8GB memory

# Or per-device

$ export BUD_SM_LIMIT_DEV0=30

$ export BUD_SM_LIMIT_DEV1=70

# Changes take effect immediately for new allocations

3. Arbitrary Resource Splits

You can define GPU partition sizes at any percentage you choose, with precise control over how resources are allocated. For example;

Tenant A: 25% compute, 4GB memory
Tenant B: 35% compute, 6GB memory
Tenant C: 40% compute, 10GB memory
# Or dynamic fair-share $ export

BUD_ISOLATION_MODE=adaptive # Resources automatically balance based on demand

4. Work-Conserving Scheduling

This is FCSP’s standout feature: idle GPU resources automatically flow to active tenants, maximizing utilization without any manual effort. For example, in a scenario with three tenants, if only Tenant A is active, it can use the full GPU. As other tenants become active, resources are dynamically reclaimed and redistributed. Our benchmarks demonstrate this can achieve up to 143.9% efficiency compared to static allocation—exceeding 100% because compute-heavy and memory-heavy workloads complement each other perfectly.

FCSP Architecture Deep Dive

The core components of FCSP form a layered architecture designed to deliver dynamic, fine-grained GPU partitioning and multi-tenant resource management without modifying existing applications. At the top level, CUDA applications run unmodified, with FCSP intercepting GPU calls via LD_PRELOAD to integrate seamlessly. The heart of the system is libvgpu.so, which orchestrates resource allocation through modules such as the Memory Manager, Compute Throttler, and Stream Classifier, while also handling NCCL hooks, UVM prefetching, and graph optimization to maximize GPU efficiency.

Resource usage is tracked in a shared memory region, maintaining per-process metrics, global utilization data, token bucket states, and burst pool management, enabling real-time visibility into GPU activity. The SM Observer Thread continuously monitors GPU status via NVML, calculates fair-share allocations, detects idle tenants, and tracks contention, ensuring resources are dynamically rebalanced. Together, these components allow FCSP to deliver high efficiency, flexibility, and multi-tenancy on any CUDA-enabled GPU.

Core Components

Isolation Modes

FCSP provides four isolation modes to match your requirements:

1. None (BUD_ISOLATION_MODE=none)

No isolation enforcement
Useful for single-tenant scenarios
Minimal overhead (~40ns per API call)

2. Balanced (BUD_ISOLATION_MODE=balanced)

Default mode
Fair sharing with burst capability
20% floor + 40% shared pool + 40% burst pool
Best for mixed workloads

3. Strict (BUD_ISOLATION_MODE=strict)

Hard quotas, no bursting
MIG-like behavior
Maximum isolation, minimum efficiency

4. Adaptive (BUD_ISOLATION_MODE=adaptive)

Automatically switches based on contention
Low contention → relaxed limits (more throughput)
High contention → strict limits (better isolation)
Best of both worlds

Rate Limiting Implementation

FCSP uses a sophisticated token bucket algorithm:

// Per-stream token buckets (eliminates CAS contention)
typedef struct {
    _Atomic int64_t tokens;      // Current tokens
    int64_t max_tokens;          // Bucket capacity
    int64_t refill_rate;         // Tokens per second
    uint64_t last_update;        // Last refill timestamp
} token_bucket_t;

// Kernel launch cost = grids × blocks × TOKEN_FACTOR(32)
// Example: 256 grids × 256 blocks = 2,097,152 tokens

// Per-stream token buckets (eliminates CAS contention)

typedef struct {

_Atomic int64_t tokens; // Current tokens

int64_t max_tokens; // Bucket capacity

int64_t refill_rate; // Tokens per second

uint64_t last_update; // Last refill timestamp

} token_bucket_t;

// Kernel launch cost = grids × blocks × TOKEN_FACTOR(32)

// Example: 256 grids × 256 blocks = 2,097,152 tokens

The system also includes:

Batch token consumption: Thread-local caching reduces atomic operations by 8-16x
PID controller: Smooth rate limiting with proportional-integral-derivative control
Exponential backoff: Graceful throttling (50ns → 10µs)

Memory Management

FCSP provides flexible memory limits:

# Absolute limit
export BUD_MEMORY_LIMIT=8G

# Percentage of GPU memory
export BUD_MEMORY_LIMIT=50%

# Enforcement modes
export BUD_MEMORY_LIMIT_MODE=hard   # Reject allocations
export BUD_MEMORY_LIMIT_MODE=soft   # Warning + callback

# Per-device limits
export BUD_MEMORY_LIMIT_DEV0=4G
export BUD_MEMORY_LIMIT_DEV1=8G

# Absolute limit

export BUD_MEMORY_LIMIT=8G

# Percentage of GPU memory

export BUD_MEMORY_LIMIT=50%

# Enforcement modes

export BUD_MEMORY_LIMIT_MODE=hard # Reject allocations

export BUD_MEMORY_LIMIT_MODE=soft # Warning + callback

# Per-device limits

export BUD_MEMORY_LIMIT_DEV0=4G

export BUD_MEMORY_LIMIT_DEV1=8G

Plus advanced UVM (Unified Virtual Memory) features:

Intelligent prefetching: Predict and prefetch based on access patterns
Memory pressure monitoring: Early warning at 70%, 85%, 95% thresholds
Automatic eviction: LRU, access-aware, or FIFO policies
Oversubscription: Use more GPU memory than physically available

Benchmark Comparison

Let’s look at real numbers comparing FCSP to native execution and MIG.

Test Environment

GPU: NVIDIA RTX 3080 (10GB, 68 SMs)
FCSP: Adaptive isolation mode
Iterations: 100 per test

Overhead Metrics

Metric	Native	FCSP	MIG (A100)
Kernel Launch Latency	~3 µs	4.9 µs	~3.5 µs
Memory Alloc Latency	~100 µs	704 µs	~100 µs
API Interception	0 ns	40 ns	0 ns
Rate Limiter	N/A	1.5 µs	N/A

Analysis: FCSP adds ~2µs overhead per kernel launch – negligible for typical GPU workloads where kernels run for milliseconds. Memory allocation is slower due to tracking, but this is amortized over the allocation lifetime.

Isolation Metrics

Metric	FCSP (Balanced)	FCSP (Adaptive)	MIG
Fairness Index	0.996	0.996	1.0
QoS Consistency (CV)	0.07	0.07	<0.05
Noisy Neighbor Impact	11.5%	4.66%	~3%
Cross-Tenant Isolation	81.4%	81.4%	~95%

Analysis: FCSP achieves near-perfect fairness (0.996 out of 1.0). With adaptive isolation, noisy neighbor impact drops to 4.66% – approaching MIG’s ~3%. The tradeoff is flexibility: FCSP allows work-conservation, MIG doesn’t.

Efficiency Metrics

Metric	FCSP	MIG
Affinity Complementary Efficiency	143.9%	N/A
Work Conservation Benefit	Up to 67%	0%
Resource Utilization (3 idle, 1 active)	~100%	33%

Analysis: FCSP’s work-conservation is a game-changer. When tenants are idle, active tenants can use the full GPU. MIG wastes 67% of resources in this scenario.

Feature Comparison Matrix

Feature	FCSP	MIG
Hardware Support
Consumer GPUs (RTX)	✅	❌
V100, T4, A10	✅	❌
A100, H100	✅	✅
Partitioning
Arbitrary splits	✅	❌
Dynamic resizing	✅	❌
No-disruption changes	✅	❌
Work-conservation	✅	❌
Isolation
Memory isolation	Software	Hardware
Compute isolation	Software	Hardware
Fault isolation	Partial	Complete
Error containment	Per-process	Per-instance
Performance
Zero overhead option	✅	✅
Burst capability	✅	❌
QoS guarantees	Soft	Hard
Operations
Kubernetes integration	Via device plugin	Native
Monitoring	NVML + custom	NVML
Configuration	Env vars	nvidia-smi

Use Cases: When to Choose FCSP vs MIG

This section highlights the use cases for FCSP compared to MIG, helping you understand when each approach is most suitable. It provides guidance on choosing the right GPU partitioning solution based on workload needs, resource flexibility, and operational requirements.

When to choose FCSP?

Choose FCSP when you need flexible, dynamic GPU partitioning with multi-tenant support across any CUDA-enabled GPU. For example, it can be the right choice in the following scenarios;

1. You Don’t Have MIG-Capable GPUs

The most straightforward scenario for choosing FCSP is when your environment includes consumer GPUs like the RTX series, older data center GPUs such as V100, T4, or P100, or mixed GPU clusters. In these cases, FCSP is the only solution that provides true multi-tenant isolation, enabling dynamic partitioning and efficient resource sharing across hardware that MIG does not support.

2. You Need Dynamic Resource Allocation

Choose FCSP when you require dynamic resource allocation that adapts to changing workload demands in real time.

# Kubernetes scenario: Training job needs more resources during backprop
apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  containers:
  - name: trainer
    env:
    - name: BUD_SM_LIMIT
      value: "50"  # Start at 50%
    # FCSP allows runtime adjustment:
    # kubectl exec training-job -- export BUD_SM_LIMIT=80

# Kubernetes scenario: Training job needs more resources during backprop

apiVersion: v1

kind: Pod

metadata:

spec:

containers:

- name: trainer

env:

- name: BUD_SM_LIMIT

value: "50" # Start at 50%

# FCSP allows runtime adjustment:

# kubectl exec training-job -- export BUD_SM_LIMIT=80

3. Variable Workload Patterns

Choose FCSP when your workloads have variable or unpredictable patterns that require flexible GPU resource management. It’s ideal for development environments that burst during testing and remain idle while coding, batch processing jobs that start and finish at different times, and time-sharing scenarios where different users are active at different hours.

An example Workday Pattern (Work-Conservation):

9:00 AM: All 4 tenants active → 25% each
12:00 PM: Tenant B at lunch → A,C,D get 33% each
2:00 PM: Tenant D in meeting → A,C get 50% each
6:00 PM: Only Tenant A working → A gets 100%

In such a scenario, using MIG would waste 75% of resources at 6 PM. But FCSP uses everything.

4. Complementary Workload Scheduling

FCSP’s workload affinity feature can co-schedule compute-heavy and memory-heavy workloads for better-than-isolated performance:

Without Affinity

Compute-heavy job: 50% SM utilization, 10% memory BW
Memory-heavy job: 10% SM utilization, 50% memory BW
Total GPU utilization: ~55% (many resources idle)

With FCSP Affinity:

Total GPU utilization: 60% SM + 60% memory BW
Both jobs run together
Efficiency: 143.9% vs isolated execution

5. LLM Inference at Scale

LLM inference presents unique challenges, including bursty memory allocation for KV caches, mixed compute patterns such as attention versus feed-forward networks, and variable batch sizes. FCSP addresses these demands with LLM-optimized profiles, ensuring efficient resource allocation and high performance for large language model workloads.

# LLM-optimized UVM configuration
export BUD_UVM_PROFILE=llm_inference
# Enables: larger prefetch (64MB), aggressive pattern detection,
# handles bursty KV cache, optimized for attention patterns

# LLM-optimized UVM configuration

export BUD_UVM_PROFILE=llm_inference

# Enables: larger prefetch (64MB), aggressive pattern detection,

# handles bursty KV cache, optimized for attention patterns

When to choose MIG?

Choose MIG when you need fixed, hardware-level GPU partitioning on supported NVIDIA data center GPUs.

1. Maximum Isolation is Critical

If tenant A crashing must never impact tenant B—such as in financial trading systems, medical imaging, or other safety-critical applications—MIG is the preferred choice for its hardware-level fault isolation, whereas FCSP offers process-level isolation.

2. Regulatory Compliance Requires Hardware Isolation

Certain compliance frameworks explicitly mandate hardware-level isolation, including HIPAA for healthcare, PCI-DSS for payment processing, and SOC 2 Type II. In these cases, MIG’s hardware partitioning can meet auditor requirements where software-based isolation like FCSP may not.

3. Guaranteed, Predictable Performance

MIG provides hard QoS guarantees:

MIG Instance: 1g.10gb

Guaranteed: 14 SMs (always)
Guaranteed: 10GB memory (always)
No variability, no neighbors

FCSP (even with strict mode):

Target: 14 SMs
Actual: 13-15 SMs (software scheduling variance)
Approx 0.07 coefficient of variation

If your SLA requires “exactly 14 SMs, never less”, use MIG.

4. You Have A100/H100 and Static Workloads

If you have:

MIG-capable hardware (A100, H100)
Predictable, always-on workloads
No need for resource flexibility

MIG is simpler to operate and has zero runtime overhead.

Configuration Guide

The Configuration Guide provides step-by-step instructions to set up and optimize FCSP for your GPU environment.

Basic FCSP Setup

# 1. Build FCSP
cd bud_fcsp && mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j8

# 2. Run with FCSP
LD_PRELOAD=/path/to/libvgpu.so python train.py

# 1. Build FCSP

cd bud_fcsp && mkdir build && cd build

cmake .. -DCMAKE_BUILD_TYPE=Release

make -j8

# 2. Run with FCSP

LD_PRELOAD=/path/to/libvgpu.so python train.py

Multi-Tenant Kubernetes Deployment

# ConfigMap for FCSP settings
apiVersion: v1
kind: ConfigMap
metadata:
  name: fcsp-config
data:
  isolation-mode: "adaptive"
  tenant-floor: "20"
  shared-pool: "40"
  burst-pool: "40"

---
# Pod with FCSP isolation
apiVersion: v1
kind: Pod
metadata:
  name: ml-workload
  labels:
    fcsp.io/tenant: "team-a"
spec:
  containers:
  - name: training
    image: pytorch/pytorch:latest
    env:
    - name: LD_PRELOAD
      value: "/opt/fcsp/libvgpu.so"
    - name: BUD_ISOLATION_MODE
      valueFrom:
        configMapKeyRef:
          name: fcsp-config
          key: isolation-mode
    - name: BUD_SM_LIMIT
      value: "50"
    - name: BUD_MEMORY_LIMIT
      value: "8G"
    volumeMounts:
    - name: fcsp-lib
      mountPath: /opt/fcsp
  volumes:
  - name: fcsp-lib
    hostPath:
      path: /usr/local/lib/fcsp

# ConfigMap for FCSP settings

apiVersion: v1

kind: ConfigMap

metadata:

data:

isolation-mode: "adaptive"

tenant-floor: "20"

shared-pool: "40"

burst-pool: "40"

---

# Pod with FCSP isolation

apiVersion: v1

kind: Pod

metadata:

labels:

fcsp.io/tenant: "team-a"

spec:

containers:

- name: training

image: pytorch/pytorch:latest

env:

- name: LD_PRELOAD

value: "/opt/fcsp/libvgpu.so"

- name: BUD_ISOLATION_MODE

valueFrom:

configMapKeyRef:

key: isolation-mode

- name: BUD_SM_LIMIT

value: "50"

- name: BUD_MEMORY_LIMIT

value: "8G"

volumeMounts:

- name: fcsp-lib

mountPath: /opt/fcsp

volumes:

- name: fcsp-lib

hostPath:

path: /usr/local/lib/fcsp

Production Configuration Profiles

Profile 1: Development Cluster

# Maximize flexibility, allow bursting
export BUD_ISOLATION_MODE=adaptive
export BUD_TENANT_FLOOR_PCT=10    # Low guaranteed floor
export BUD_SHARED_POOL_PCT=30     # Moderate shared pool
export BUD_BURST_MAX_PCT=60       # High burst capability
export BUD_ADAPTIVE_SM_MAX_BURST_PCT=100  # Allow full burst

# Maximize flexibility, allow bursting

export BUD_ISOLATION_MODE=adaptive

export BUD_TENANT_FLOOR_PCT=10 # Low guaranteed floor

export BUD_SHARED_POOL_PCT=30 # Moderate shared pool

export BUD_BURST_MAX_PCT=60 # High burst capability

export BUD_ADAPTIVE_SM_MAX_BURST_PCT=100 # Allow full burst

Profile 2: Production ML Serving

# Balance isolation and efficiency
export BUD_ISOLATION_MODE=balanced
export BUD_TENANT_FLOOR_PCT=40    # High guaranteed floor
export BUD_SHARED_POOL_PCT=30     # Moderate shared pool
export BUD_BURST_MAX_PCT=30       # Limited burst
export BUD_COOPERATIVE_MODE=1     # Smooth throttling

# Balance isolation and efficiency

export BUD_ISOLATION_MODE=balanced

export BUD_TENANT_FLOOR_PCT=40 # High guaranteed floor

export BUD_SHARED_POOL_PCT=30 # Moderate shared pool

export BUD_BURST_MAX_PCT=30 # Limited burst

export BUD_COOPERATIVE_MODE=1 # Smooth throttling

Profile 3: High-Isolation Multi-Tenant

# Maximum isolation, MIG-like behavior
export BUD_ISOLATION_MODE=strict
export BUD_TENANT_FLOOR_PCT=100   # All resources guaranteed
export BUD_SHARED_POOL_PCT=0      # No sharing
export BUD_BURST_MAX_PCT=0        # No bursting

# Maximum isolation, MIG-like behavior

export BUD_ISOLATION_MODE=strict

export BUD_TENANT_FLOOR_PCT=100 # All resources guaranteed

export BUD_SHARED_POOL_PCT=0 # No sharing

export BUD_BURST_MAX_PCT=0 # No bursting

Performance Tuning

Performance Tuning offers strategies to maximize GPU efficiency and workload throughput with FCSP.

Reducing Overhead

# 1. Enable fast paths (default in production)
export BUD_FAST_PATH_ENABLED=1

# 2. Increase batch token size for high-throughput
export BATCH_TOKENS_DEFAULT_SIZE=64  # Default: 16

# 3. Disable unnecessary features
export BUD_ENABLE_METRICS=0          # If not monitoring
export BUD_LOG_LEVEL=error           # Reduce logging

# 4. Use SIMD optimizations (auto-detected)
# FCSP automatically uses AVX2/SSE4.2 for slot finding

# 1. Enable fast paths (default in production)

export BUD_FAST_PATH_ENABLED=1

# 2. Increase batch token size for high-throughput

export BATCH_TOKENS_DEFAULT_SIZE=64 # Default: 16

# 3. Disable unnecessary features

export BUD_ENABLE_METRICS=0 # If not monitoring

export BUD_LOG_LEVEL=error # Reduce logging

# 4. Use SIMD optimizations (auto-detected)

# FCSP automatically uses AVX2/SSE4.2 for slot finding

Improving Isolation

# 1. Use adaptive mode for dynamic workloads
export BUD_ISOLATION_MODE=adaptive

# 2. Increase floor for better guarantees
export BUD_TENANT_FLOOR_PCT=40

# 3. Enable PID controller for smooth throttling
export BUD_USE_PID_CONTROLLER=1
export PID_DEFAULT_KP=0.8
export PID_DEFAULT_KI=0.2

# 4. Reduce observer interval for faster response
export BUD_OBSERVER_INTERVAL_US=2000  # 2ms instead of 5ms

# 1. Use adaptive mode for dynamic workloads

export BUD_ISOLATION_MODE=adaptive

# 2. Increase floor for better guarantees

export BUD_TENANT_FLOOR_PCT=40

# 3. Enable PID controller for smooth throttling

export BUD_USE_PID_CONTROLLER=1

export PID_DEFAULT_KP=0.8

export PID_DEFAULT_KI=0.2

# 4. Reduce observer interval for faster response

export BUD_OBSERVER_INTERVAL_US=2000 # 2ms instead of 5ms

Memory Oversubscription (UVM)

# Enable UVM oversubscription for large models
export BUD_UVM_ENABLED=1
export BUD_UVM_PROFILE=llm_inference

# Or manual configuration
export BUD_UVM_PRESSURE_WARNING_PCT=70
export BUD_UVM_PRESSURE_HIGH_PCT=85
export BUD_UVM_PRESSURE_CRITICAL_PCT=95
export BUD_PREFETCH_ENABLED=1
export BUD_PREFETCH_MAX_SIZE_MB=64

# Enable UVM oversubscription for large models

export BUD_UVM_ENABLED=1

export BUD_UVM_PROFILE=llm_inference

# Or manual configuration

export BUD_UVM_PRESSURE_WARNING_PCT=70

export BUD_UVM_PRESSURE_HIGH_PCT=85

export BUD_UVM_PRESSURE_CRITICAL_PCT=95

export BUD_PREFETCH_ENABLED=1

export BUD_PREFETCH_MAX_SIZE_MB=64

Limitations and Considerations

Both FCSP and MIG have trade-offs that should be carefully considered when choosing the right GPU partitioning solution.

FCSP Limitations

Software Isolation Only: A malicious or buggy tenant could potentially bypass isolation (MIG has hardware enforcement)
Overhead: ~2µs per kernel launch, ~600µs for memory allocation (negligible for most workloads, but measurable)
No Hardware Error Isolation: GPU errors (ECC, timeouts) affect all tenants
Requires LD_PRELOAD: Application must use dynamically-linked CUDA
Approximated Metrics: SM utilization is polled (5ms interval), not instantaneous

MIG Limitations

Limited Hardware: Only A100, A30, H100, H200
Fixed Profiles: Cannot create arbitrary partition sizes
Static Allocation: Requires workload termination to resize
No Work Conservation: Idle resources are wasted
Reduced Total Performance: Sum of MIG instances < full GPU performance
Complexity: More infrastructure to manage (MIG instances, CUDA MIG handles)

Conclusion

The choice between FCSP and MIG isn’t about which is “better” – it’s about which fits your requirements:

Requirement	Recommendation
No MIG hardware	FCSP (only option)
Dynamic workloads	FCSP (work-conservation)
Maximum isolation	MIG (hardware guarantee)
Compliance requirements	MIG (auditable hardware)
Development clusters	FCSP (flexibility)
Predictable production	MIG (simplicity)
Mixed GPU types	FCSP (universal support)
LLM inference	FCSP (optimized profiles)

For most practical scenarios, FCSP provides 80-95% of MIG’s isolation benefits with significantly more flexibility and universal hardware support. The work-conservation feature alone can improve cluster utilization by 40-67% in realistic multi-tenant scenarios. Consider FCSP when you value flexibility, efficiency, and broad hardware support. Choose MIG when you need the absolute guarantee of hardware isolation and have compatible GPUs.

This article is based on FCSP v1.0 and NVIDIA MIG as of December 2025. Benchmarks were conducted on an NVIDIA RTX 3080 (10GB, 68 SMs) with CUDA 12.0.