Deploy Tanzu Kubernetes Cluster with vGPU

by Mohamed Imthiyaz
vGPU cluster

From transforming healthcare to revolutionizing finance, Artificial Intelligence (AI) and Machine Learning (ML) workloads have become crucial drivers of innovation. These data-centric tasks, however, demand immense computational power, making GPU deployment a game-changer. VMware’s Tanzu Kubernetes Grid Service (TKGS) equipped with vGPU support emerges as the ideal solution, enabling AI/ML workloads to harness the full potential of virtualized GPUs.

In this blog post, we will see how to deploy AI/ML workloads on TKGS clusters with vGPU support.


Before we dive into the deployment process, ensure you have the following prerequisites in place:

  1. Up and Running TKGs: Tanzu Kubernetes Grid Service must be operational on your vSphere environment. (Check from vCenter -> Workload Management.)
  2. GPU Installed on Host: Your ESXi hosts should have compatible NVIDIA GPUs installed and recognized by the system.
  3. NVIDIA VIB Installed on ESXi: Verify that the NVIDIA driver is installed on ESXi (esxcli software vib list | grep NVIDIA)

For detailed instructions on installing the NVIDIA VIB on ESXi hosts, you can refer to the official VMware documentation provided here.

My Environment

vCenter: 8.0.1
ESXi: 8.0.1


Step 1: Create a Custom VM Class with a vGPU Profile

Login to vCenter -> Workload Management -> Services -> Click Manage under “VM Service”

Click “Create VM Class”

create vm class

Please note we have to select PCI Devices (100% Mem reservation)

vgpu vm class

Select your PCI Device

vm class
vgpu vm class

Verify that the new custom VM Class is available in the list of VM Classes.

vm class

Step 2: Create a Namespace and add VMclass and Content Library

create namespace

Add the custom VM Class which we created in Step 1 and your TKGs Content Library

VM Class

Step 3: Install kubectl vSphere and login to Supervisor Cluster

curl -LOk https://Control Plane Node Address/wcp/plugin/linux-amd64/vsphere-plugin.zip
unzip vsphere-plugin.zip
mv -v bin/* /usr/local/bin/
kubectl vsphere login --server=IP-ADDRESS --vsphere-username USERNAME --insecure-skip-tls-verify

Step 4: Prepare yaml to deploy the vGPU cluster

Switch to the context we created in Step 2

Get tkr
kubectl config get-contexts
kubectl config use-context vGPU-NAMESPACE
kubectl get virtualmachineclassbindings

Create a yaml file to deploy the cluster

apiVersion: run.tanzu.vmware.com/v1alpha2
kind: TanzuKubernetesCluster
   #cluster name
   name: tkgs-cluster-gpu-a10
   #target vsphere namespace
   namespace: vgpu-deploy
       replicas: 1
       #storage class for control plane nodes
       #use `kubectl describe storageclasses`
       #to get available pvcs
       storageClass: tkgs-sp
       vmClass: guaranteed-medium
       #TKR NAME for Ubuntu ova supporting GPU
           name: 1.22.9---vmware.1-tkg.1
     - name: nodepool-a10-primary
       replicas: 1
       storageClass: tkgs-sp
       #custom VM class for vGPU
       vmClass: vgpu-vm-class
       #TKR NAME for Ubuntu ova supporting GPU 
           name: 1.22.9---vmware.1-tkg.1
     - name: nodepool-a10-secondary
       replicas: 1
       vmClass: vgpu-vm-class
       storageClass: tkgs-sp
       #TKR NAME for Ubuntu ova supporting GPU
           name: 1.22.9---vmware.1-tkg.1
       defaultClass: tkgs-sp
        name: antrea
        cidrBlocks: [""]
        cidrBlocks: [""]
       serviceDomain: managedcluster.local

Step 5: Deploy the cluster

kubectl apply -f vgpu-cluster.yaml
Deploy Cluster

TKC cluster with vGPU has been deployed. Hope this helps!

