Skip to content

c13/gpu-slicing

Repository files navigation

GPU Slicing

Time-slicing allows GPUs to be oversubscribed. Under the hood, CUDA time-slicing allows workloads that land on oversubscribed GPUs to interleave with one another. Each workload has access to the GPU memory and runs in the same fault domain as all the others.

Utilizing the GPU with Karpenter not only saves cost but, more importantly, also provides us with a flexible method to schedule GPU resources to our applications within the Kubernetes cluster. Since you may own tens of applications that need the GPU in different time slots, it is so important to schedule them in a more cost-effective way in the cloud.

Slicing can be used on Nvidia GPUs in two ways: software and hardware (Multi-Instance GPU, MIG).

EKS supports both modes using the Nvidia GPU Operator and Nvidia Device Plugin. MIG functionality is provided as part of the NVIDIA GPU driver for H100, A100, and A30 cores.

Opportunities to use GPU Slicing

Here are some example workloads that can benefit from sharing GPU resources for better utilization:

  • Low-batch inference serving, which may only process one input sample on the GPU
  • High-performance computing (HPC) applications, such as simulating photon propagation, that balance computation between the CPU (to read and process inputs) and GPU (to perform computation). Some HPC applications may not achieve high throughput on the GPU portion due to bottlenecks in the CPU core performance.
  • Interactive development for ML model exploration using Jupyter notebooks
  • Spark-based data analytics applications, where some tasks, or the most minor units of work, are run concurrently and benefit from better GPU utilization
  • Visualization or offline rendering applications that may be bursty
  • Continuous integration/continuous delivery (CICD) pipelines that want to use any available GPUs for testing

Benefits of Multi-Instance GPU

Multi-Instance GPUs are typically used for GPU-intensive applications such as HPC workloads, hyperparameter tuning, etc. They are also used for AI model training and Inference servers where high performance and higher security between processes are required.

  • MIG ensures that GPU resources are fully utilized, reducing idle times and improving overall efficiency.
  • MIG static partitioning of GPUs into multiple isolated instances, each with its dedicated portion of resources, including streaming multiprocessor (SM), ensuring better and predictable streaming multiprocessor (SM) quality of service (QoS).
  • Dedicated portion of memory within multiple isolated instances ensures better memory QoS.
  • Static partitioning eliminates error, resulting in fault containment and system stability.
  • Better data protection and isolation of malicious activities, providing better security for multi-tenant setups.

Cost optimization

Let's create a scenario for cost optimization. For example, we have four instances g4dn.2xlarge with GPU utilization around 25%. We can run four workloads on every GPU, so we need only one instance with four vGPUs.

g4dn.2xlarge price per hour $0.752 (prices may vary by region)
4 instances × $0.752 × 24 hours/day × 30 days = $2165 monthly
1 instances × $0.752 × 24 hours/day × 30 days = $541 monthly

So, we could reduce GPU instance costs by $1624 per month.

Performance Implications of GPU Slicing

While GPU slicing offers advantages, it can also lead to some performance issues:

  • Lower performance for heavy tasks. Programs that need the full power of a GPU might run slower when the GPU is divided into slices.
  • Increased latency. Sharing the GPU over time can make tasks wait longer for their turn, adding delays.
  • Limited memory. Each GPU slice has less memory, which can be a problem for tasks requiring a lot of GPU memory.
  • Possible resource competition. Even with isolation, tasks using the same physical GPU might still compete for resources.
  • Inconsistent performance. The performance of GPU slices can change based on how much the entire GPU is being used.

Monitoring the utilization of GPU

You could use special instruments to monitor and find under-provisioned GPU in your k8s cluster.

DCHM exprter Nvidia SMI

Configuration of Slicing

Install NodeClass and EC2NodeClass for GPU workload:

kubectl apply -f karpenter-soft.yaml

There is a taint isolating expensive hardware from the general workload:

    taints:
    - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule

Add the config map to the same namespace as the GPU operator:

kubectl create -n gpu-operator -f time-slicing-config-fine.yaml

Configure the device plugin with the config map and set the default time-slicing configuration:

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
    -n gpu-operator --type merge \
    -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-fine"}}}}'

The time-slicing configuration will only be applied to new nodes.

Restart daemonsets. Confirm that the gpu-feature-discovery and nvidia-device-plugin-daemonset pods restart:

kubectl rollout restart -n gpu-operator daemonset/nvidia-device-plugin-daemonset
kubectl get events -n gpu-operator --sort-by='.lastTimestamp'

Create the deployment with multiple replicas:

kubectl apply -f time-slicing-verification.yaml

Verify that all five replicas are running:

kubectl get pods

##References

GPU Sharing techniques

Nvidia GPU sharing

Improving GPU utilization

Efficient access to shared GPUs

GPU Sharing on Amazon EKS

GPU Telemetry

MIG User Guide

GPU Advanced Troubleshooting

About

GPU Slicing and Implementation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published