AI on-demand with Kubernetes (Part 1)

So, playing as I am with AI applications these days, I convinced myself (hey, it was my birthday…) that I needed to build a machine with a decent graphics card in it. An A100 is somewhat outside my price range, and even my preferred Nvidia L4s (just look at that blissfully low power requirement…) would break the budget, so I was forced (honestly!) to look at a GTX4090 build. The GPU performance is less important than the 24GB of VRAM that allows me to play with larger models than my current machine can handle, and it’s probably the most cost-effective option on the market at the moment for that.

Having built the beast, it does seem like a bit of a shame not to use it for maybe playing the odd game, though…

So, I’d like it to be able to run AI tasks when I want it to, but at the same time be a decent gaming machine. It seems to me that this is the kind of problem I can solve with Kubernetes; if I add it to my Kubernetes cluster, but then configure AI applications to “scale to zero” the actual compute-heavy AI workloads (such as a custom coding assistant model) when I’m not making use of them, I can have the best of both worlds. AI-on-demand when I’m doing work, and a GPU all-to-myself when I want to run Steam.

So, this is a record of my journey getting this set up. Part one will document getting the basics ready - adding the machine to my Kubernetes cluster. In the next part, I’ll have worked out how to deploy an example AI workload with automatic scale-up and scale-down when idle.

Installing a CUDA Node in Kubernetes⌗

Basic configuration⌗

First, we configure our machine as we would normally for a Kubernetes cluster (this is out of the scope of this article, but essentially we need to do all the configuration you would need for any node, like turning off swap and adjusting kernel parameters for CNI to work, and of course ensuring containerd is installed), and then join the cluster:

kubeadm join cv-new.k8s.int.snowgoons.ro:6443 \
  --token <...> \
  --discovery-token-ca-cert-hash sha256:<...>

I don’t want to let any old workload be deployed to this machine though - after all, sometimes I want to use it to play Tomb Raider (it’s a shame to waste a nice graphics card,) and I don’t want my games slowing down because Kubernetes scheduled PostgreSQL or Kafka on that machine.

So, I add a taint to the new node, which restricts scheduling on that node. Only workloads which have been specifically deployed with a corresponding toleration will be scheduled to my games, err, research, machine:

kubectl taint nodes joi restriction=CUDAonly:NoSchedule

NVIDIA CUDA configuration on the node⌗

Nvidia provide an operator which will automate the deployment of appropriate GPU drivers and container toolkit on the worker nodes that have GPUs.

However, my node is not dedicated to Kubernetes. It will be used as a workstation (and gaming machine!) as well as a specialist Kubernetes node for CUDA deployments. So I prefer to be able to manage the deployment of graphics drivers etc. myself; that means installing the GPU drivers and container toolkit on the machine manually.

I covered installing the CUDA toolkit in an earlier article, so I’ll not go over that again. But installing the Container Toolkit is a new challenge - and in particular, I need to install it for containerd (as used on my Kubernetes cluster) and not Docker. To do this I’m going to follow the instructions on the NVIDIA Website.

Firstly, we need to install the software; since I use Ubuntu 22.04 as my host operating system, we install from Apt:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

Then we need to configure containerd. In the past when I did similar to enable the container toolkit for Docker, there was some assembly required here - but mercifully NVIDIA have made configuration for containerd trivially easy:

sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

To see if it’s working, let’s try running a CUDA container directly with containerd. We should be able to execute the nvidia-smi command within an appropriate container and see our graphics card:

sudo ctr images pull docker.io/nvidia/cuda:12.5.0-runtime-ubuntu22.04
sudo ctr run --gpus 0 --rm docker.io/nvidia/cuda:12.5.0-runtime-ubuntu22.04 nvidia-smi nvidia-smi

Hopefully, you will see an output like this:

Sun Jun 30 08:32:31 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.5     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:01:00.0  On |                  Off |
|  0%   34C    P8              22W / 450W |    478MiB / 24564MiB |     12%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Interesting to note that the container runtime is properly segregating the GPU among containers - if I ran that same command directly on the host, I’d see a couple of processes that are making use of the GPU but which are hidden in the container environment.

Deploying the NVIDIA GPU operator⌗

The NVIDIA GPU operator gives Kubernetes the knowledge of GPUs as a special resource that containers can require to run. We install this operator so that we can then add resource limits like this to our deployment manifests and have Kubernetes automatically schedule the pod on an appropriate node:

spec:
  containers:
    resources:
      limits:
        nvidia.com/gpu: 1

Installing the operator is as simple as adding the Helm repo and then installing like so:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name \
     -n nvidia-gpu-operator --create-namespace \
      nvidia/gpu-operator \
      --set driver.enabled=false \
      --set toolkit.enabled=false

Installing this Helm chart will deploy a feature-discovery-worker to every node in your cluster. This will then run on the node and attempt to determine if that node has a GPU available in the container runtime. If it does, it should add a label feature.node.kubernetes.io/pci-10de.present=true to each node that has an NVIDIA GPU attached.

BUT, in our environment this won’t work. Why not? Because we have a taint on our GPU node; this will prevent the discovery worker being deployed on the one node where we really need it.

So we need to edit the values.yaml for our deployment and add a toleration to the daemonset, and then update (or if you read this far without making the mistake I did, use an appropriate configuration from the beginning!)

# values.yaml for deploying on tainted cluster
driver:
  enabled: false
toolkit:
  enabled: false
daemonsets:
  tolerations:
  - key: restriction
    operator: "Equal"
    value: "CUDAonly"
    effect: "NoSchedule"
operator:
  tolerations:
  - key: restriction
    operator: "Equal"
    value: "CUDAonly"
    effect: "NoSchedule"
node-feature-discovery:
  worker:
    tolerations:
    - key: restriction
      operator: "Equal"
      value: "CUDAonly"
      effect: "NoSchedule"

So, our first test to see if it is working is to have a look-see for that label and other attributes on our CUDA-enabled node. From kubectl describe node joi¹:

Name:               joi
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    [...snipped for brevity...]
                    feature.node.kubernetes.io/kernel-version.major=6
                    feature.node.kubernetes.io/kernel-version.minor=5
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-10ec.present=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=ubuntu
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=joi
                    kubernetes.io/os=linux
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=true
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.present=true

It’s there! Along with a selection of other labels added by the NVIDIA operator. If we carry on down, we can also see that now the scheduler is aware of the resource type nvidia.com/gpu:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests        Limits
  --------           --------        ------
  cpu                650m (2%)       1 (3%)
  memory             471966464 (0%)  1101219328 (1%)
  ephemeral-storage  0 (0%)          0 (0%)
  hugepages-1Gi      0 (0%)          0 (0%)
  hugepages-2Mi      0 (0%)          0 (0%)
  nvidia.com/gpu     0               0

Testing it works - deploying a simple CUDA workload⌗

SO, now we have it working (hopefully), we can try deploying the example Jupyter Notebook deployment - my test-jupyter.yaml looks just like the one on NVIDIA’s documentation, but with the added toleration for my CUDA node (and I use a LoadBalancer for my service):

---
apiVersion: v1
kind: Service
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  type: LoadBalancer
  ports:
  - port: 80
    name: http
    targetPort: 8888
  selector:
    app: tf-notebook
---
apiVersion: v1
kind: Pod
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  runtimeClassName: nvidia
  securityContext:
    fsGroup: 0
  tolerations:
  - key: restriction
    operator: "Equal"
    value: "CUDAonly"
    effect: "NoSchedule"
  containers:
  - name: tf-notebook
    image: tensorflow/tensorflow:latest-gpu-jupyter    
    resources:
      limits:
        nvidia.com/gpu: 1
    ports:
    - containerPort: 8888
      name: notebook

kubectl get services tells me the IP it allocated…

NAME          TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)        AGE
tf-notebook   LoadBalancer   10.102.248.172   192.168.0.196   80:30392/TCP   10m

…and, with bated breath, pointing a web browser at it:

It works!

That’s good enough for a Sunday afternoon’s work. In the next part, I’ll work out how to take an application like Stable Diffusion/ComfyUI and deploy it so that it can scale-up and scale-down (to zero) on demand when I want to use it.

In case it wasn’t already obvious, my AI node is named after the AI character in Blade Runner 2049 ↩︎