AI on-demand with Kubernetes (Part 1)
So, playing as I am with AI applications these days, I convinced myself (hey, it was my birthday…) that I needed to build a machine with a decent graphics card in it. An A100 is somewhat outside my price range, and even my preferred Nvidia L4s (just look at that blissfully low power requirement…) would break the budget, so I was forced (honestly!) to look at a GTX4090 build. The GPU performance is less important than the 24GB of VRAM that allows me to play with larger models than my current machine can handle, and it’s probably the most cost-effective option on the market at the moment for that.
Having built the beast, it does seem like a bit of a shame not to use it for maybe playing the odd game, though…
So, I’d like it to be able to run AI tasks when I want it to, but at the same time be a decent gaming machine. It seems to me that this is the kind of problem I can solve with Kubernetes; if I add it to my Kubernetes cluster, but then configure AI applications to “scale to zero” the actual compute-heavy AI workloads (such as a custom coding assistant model) when I’m not making use of them, I can have the best of both worlds. AI-on-demand when I’m doing work, and a GPU all-to-myself when I want to run Steam.
So, this is a record of my journey getting this set up. Part one will document getting the basics ready - adding the machine to my Kubernetes cluster. In the next part, I’ll have worked out how to deploy an example AI workload with automatic scale-up and scale-down when idle.
Installing a CUDA Node in Kubernetes⌗
Basic configuration⌗
First, we configure our machine as we would normally for a Kubernetes
cluster (this is out of the scope of this article, but essentially we need
to do all the configuration you would need for any node, like turning off
swap and adjusting kernel parameters for CNI to work, and of course
ensuring containerd
is installed), and then join the cluster:
kubeadm join cv-new.k8s.int.snowgoons.ro:6443 \
--token <...> \
--discovery-token-ca-cert-hash sha256:<...>
I don’t want to let any old workload be deployed to this machine though - after all, sometimes I want to use it to play Tomb Raider (it’s a shame to waste a nice graphics card,) and I don’t want my games slowing down because Kubernetes scheduled PostgreSQL or Kafka on that machine.
So, I add a taint to the new node, which restricts scheduling on that node. Only workloads which have been specifically deployed with a corresponding toleration will be scheduled to my games, err, research, machine:
kubectl taint nodes joi restriction=CUDAonly:NoSchedule
NVIDIA CUDA configuration on the node⌗
Nvidia provide an operator which will automate the deployment of appropriate GPU drivers and container toolkit on the worker nodes that have GPUs.
However, my node is not dedicated to Kubernetes. It will be used as a workstation (and gaming machine!) as well as a specialist Kubernetes node for CUDA deployments. So I prefer to be able to manage the deployment of graphics drivers etc. myself; that means installing the GPU drivers and container toolkit on the machine manually.
I covered installing the CUDA toolkit in an earlier article,
so I’ll not go over that again. But installing the Container Toolkit is
a new challenge - and in particular, I need to install it for containerd
(as
used on my Kubernetes cluster) and not Docker. To do this I’m going to
follow the instructions on the NVIDIA Website.
Firstly, we need to install the software; since I use Ubuntu 22.04 as my host operating system, we install from Apt:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
Then we need to configure containerd
. In the past when I did similar to
enable the container toolkit for Docker, there was some assembly required here -
but mercifully NVIDIA have made configuration for containerd
trivially
easy:
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
To see if it’s working, let’s try running a CUDA container directly with
containerd
. We should be able to execute the nvidia-smi
command within
an appropriate container and see our graphics card:
sudo ctr images pull docker.io/nvidia/cuda:12.5.0-runtime-ubuntu22.04
sudo ctr run --gpus 0 --rm docker.io/nvidia/cuda:12.5.0-runtime-ubuntu22.04 nvidia-smi nvidia-smi
Hopefully, you will see an output like this:
Sun Jun 30 08:32:31 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.5 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 On | Off |
| 0% 34C P8 22W / 450W | 478MiB / 24564MiB | 12% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Interesting to note that the container runtime is properly segregating the GPU among containers - if I ran that same command directly on the host, I’d see a couple of processes that are making use of the GPU but which are hidden in the container environment.
Deploying the NVIDIA GPU operator⌗
The NVIDIA GPU operator gives Kubernetes the knowledge of GPUs as a special resource that containers can require to run. We install this operator so that we can then add resource limits like this to our deployment manifests and have Kubernetes automatically schedule the pod on an appropriate node:
spec:
containers:
resources:
limits:
nvidia.com/gpu: 1
Installing the operator is as simple as adding the Helm repo and then installing like so:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name \
-n nvidia-gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false
Installing this Helm chart will deploy a feature-discovery-worker
to every
node in your cluster. This will then run on the node and attempt to determine
if that node has a GPU available in the container runtime. If it does,
it should add a label feature.node.kubernetes.io/pci-10de.present=true
to
each node that has an NVIDIA GPU attached.
BUT, in our environment this won’t work. Why not? Because we have a taint on our GPU node; this will prevent the discovery worker being deployed on the one node where we really need it.
So we need to edit the values.yaml
for our deployment and add a toleration
to the daemonset, and then update (or if you read this far without making the
mistake I did, use an appropriate configuration from the beginning!)
# values.yaml for deploying on tainted cluster
driver:
enabled: false
toolkit:
enabled: false
daemonsets:
tolerations:
- key: restriction
operator: "Equal"
value: "CUDAonly"
effect: "NoSchedule"
operator:
tolerations:
- key: restriction
operator: "Equal"
value: "CUDAonly"
effect: "NoSchedule"
node-feature-discovery:
worker:
tolerations:
- key: restriction
operator: "Equal"
value: "CUDAonly"
effect: "NoSchedule"
So, our first test to see if it is working is to have a look-see for that
label and other attributes on our CUDA-enabled node. From kubectl describe node joi
1:
Name: joi
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/cpu-cpuid.AESNI=true
[...snipped for brevity...]
feature.node.kubernetes.io/kernel-version.major=6
feature.node.kubernetes.io/kernel-version.minor=5
feature.node.kubernetes.io/kernel-version.revision=0
feature.node.kubernetes.io/pci-10de.present=true
feature.node.kubernetes.io/pci-10ec.present=true
feature.node.kubernetes.io/storage-nonrotationaldisk=true
feature.node.kubernetes.io/system-os_release.ID=ubuntu
feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
kubernetes.io/arch=amd64
kubernetes.io/hostname=joi
kubernetes.io/os=linux
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=true
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.present=true
It’s there! Along with a selection of other labels added by the NVIDIA
operator. If we carry on down, we can also see that now the scheduler
is aware of the resource type nvidia.com/gpu
:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 650m (2%) 1 (3%)
memory 471966464 (0%) 1101219328 (1%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Testing it works - deploying a simple CUDA workload⌗
SO, now we have it working (hopefully), we can try deploying the example
Jupyter Notebook deployment - my test-jupyter.yaml
looks just like the one
on NVIDIA’s documentation,
but with the added toleration for my CUDA node (and I use a LoadBalancer for
my service):
---
apiVersion: v1
kind: Service
metadata:
name: tf-notebook
labels:
app: tf-notebook
spec:
type: LoadBalancer
ports:
- port: 80
name: http
targetPort: 8888
selector:
app: tf-notebook
---
apiVersion: v1
kind: Pod
metadata:
name: tf-notebook
labels:
app: tf-notebook
spec:
runtimeClassName: nvidia
securityContext:
fsGroup: 0
tolerations:
- key: restriction
operator: "Equal"
value: "CUDAonly"
effect: "NoSchedule"
containers:
- name: tf-notebook
image: tensorflow/tensorflow:latest-gpu-jupyter
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8888
name: notebook
kubectl get services
tells me the IP it allocated…
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
tf-notebook LoadBalancer 10.102.248.172 192.168.0.196 80:30392/TCP 10m
…and, with bated breath, pointing a web browser at it: It works!
That’s good enough for a Sunday afternoon’s work. In the next part, I’ll work out how to take an application like Stable Diffusion/ComfyUI and deploy it so that it can scale-up and scale-down (to zero) on demand when I want to use it.
-
In case it wasn’t already obvious, my AI node is named after the AI character in Blade Runner 2049 ↩︎