In part 1 I succeeded in setting up a CUDA-capable node on my Kubernetes cluster, and thanks to the NVIDIA GPU operator, have Kubernetes capable of scheduling suitable workloads.

In this part, I’ll take a typical AI workload - in this case, the generative image AI application ComfyUI, deploy it in Kubernetes, and then work out how to configure it to scale down to zero when I’m not using it (leaving my computer free to play Shadow of the Tomb Raider in peace,) and then to scale it back up again on-demand when I want to use Comfy.

Building and Deploying an AI app⌗

If you followed the first half of this story, you’ll know that this is pretty painless - essentially I just need to make sure that my Kubernetes deployment requests the right runtime, GPU resources, and away we go. For completeness though, I’ll include all the details here.

Putting ComfyUI in a Docker container⌗

Firstly, we’ll need a Docker image capable of running Comfy. That basically means a container with a few gigabytes of Python libraries installed, and the ComfyUI application itself.

As a base image, I’ll use one of NVIDIA’s base-images that includes all the CUDA/GPU drivers already baked into it; the key thing to note here is that you’ll want an image that matches the version of the CUDA drivers installed on your host nodes. You can check that with nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:01:00.0  On |                  Off |
|  0%   36C    P8              30W / 450W |    445MiB / 24564MiB |     11%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

I have CUDA version 12.2 on my host, so I’ll use a corresponding image: nvidia/cuda:12.2.0-base-ubuntu22.04.

The Dockerfile that I created to then install ComfyUI is pretty unremarkable. Note that I do also include some other custom nodes that I’ve found useful at one time or another:

FROM nvidia/cuda:12.2.0-base-ubuntu22.04

RUN apt-get update

# Satisfy tzdata whingeing from APT
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=Etc/UTC
RUN apt-get install -y tzdata

# Install a current version of Python
RUN apt-get -y install software-properties-common
RUN add-apt-repository -y ppa:deadsnakes/ppa
RUN apt-get update
RUN apt-get install -y python3.11
RUN apt-get install -y python3.11-dev

# And make sure it's the one we want
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 10
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1
RUN update-alternatives --auto python3

# PIP
RUN apt-get install -y python3-pip
RUN pip3 install --upgrade pip

# GIT
RUN apt-get install -y git

# Now, start installing ComfyUI
WORKDIR /usr/local
RUN git clone https://github.com/comfyanonymous/ComfyUI.git

# Some custom nodes that I find useful
WORKDIR /usr/local/ComfyUI/custom_nodes
RUN git clone https://github.com/Extraltodeus/ComfyUI-AutomaticCFG
RUN git clone https://github.com/Clybius/ComfyUI-Extra-samplers
RUN git clone https://github.com/flowtyone/ComfyUI-Flowty-LDSR.git
RUN git clone https://github.com/ltdrdata/ComfyUI-Manager
RUN git clone https://github.com/Suzie1/ComfyUI_Comfyroll_CustomNodes.git
RUN git clone https://github.com/city96/ComfyUI_ExtraModels
RUN git clone https://github.com/ssitu/ComfyUI_UltimateSDUpscale --recursive

# Install all the package dependencies
WORKDIR /usr/local/ComfyUI
RUN pip3 install --default-timeout=1000 --no-cache-dir torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu122
RUN find . -name requirements.txt -exec pip3 --no-input install --default-timeout=1000 --no-cache-dir -r {} \;

COPY comfyui.sh /usr/local/bin/comfyui
ENTRYPOINT [ "/usr/local/bin/comfyui" ]

For completeness, the comfyui.sh entrypoint script just looks like this:

#!/bin/sh
cd /usr/local/ComfyUI

/usr/bin/python3 main.py $*

The result of this is not a small Docker image (in fact, it’s around 6GB…), but I’m using a private registry and not uploading over the Internet, so I’ve not really made any effort to optimise that. It works well enough for our purposes.

It might be worth noting, this includes no model data/tensorfiles at this point. I’ll deal with that in the Kubernetes deployment. At the most basic, we just need a Pod to contain our ComfyUI docker image, and a LoadBalancer to give me access to it:

---
apiVersion: v1
kind: Service
metadata:
  name: comfyui
  labels:
    app: comfyui
spec:
  type: LoadBalancer
  ports:
  - port: 80
    name: http
    targetPort: 8188
  selector:
    app: comfyui
---
apiVersion: v1
kind: Pod
metadata:
  name: comfyui
  labels:
    app: comfyui
spec:
  runtimeClassName: nvidia
  tolerations:
  - key: restriction
    operator: "Equal"
    value: "CUDAonly"
    effect: "NoSchedule"
  containers:
  - name: comfyui
    image: registry.svc.snowgoons.ro/snowgoons/comfyui:2024-07-02
    args: [ "--listen", "0.0.0.0" ]
    imagePullPolicy: Always
    resources:
      limits:
        nvidia.com/gpu: 1
    ports:
    - containerPort: 8188
      name: comfyui
    volumeMounts:
    - mountPath: /usr/local/ComfyUI/models
      name: model-folder
  volumes:
  - name: model-folder
    hostPath:
      path: /usr/local/ai/models/comfyui

As before, we include the runtimeClass and nvidia.com/gpu attributes that tell it we need GPU access, as well as a toleration to let it run on my CUDA capable machine. One kubectl apply later, and we have ComfyUI running in Kubernetes.

Scale to Zero⌗

So, now for the fun part. I want my application to be available whenever I want it - I want to just point my browser at the URL, and have it work immediately (or at least, as close to immediately as possible.) Equally, when I walk away and decide to do something else, I’d like those resources to be cleaned up for me so I can use them for more important things. Like Lara Croft.

Typically, scaling up and down would be handled using a HorizontalPodAutoscaler - with just one small problem: the standard HPA can’t scale down to zero.

The canonical solution to this is to use something like KNative, and it’s a solution that works extremely well for event-based workloads; KNative can monitor an event bus, and when there is nothing in the queue it will scale your workload down to zero, and when events start appearing in the queue it scales them up again to handle. It works very well in practice as well as theory, and in my day job we have production KNative workloads managed exactly like this.

Unfortunately though, in this case my services are not event based, they are web based HTTP applications - and HTTP is very much connection and request oriented, not event based. How to square the circle?

The obvious answer is to develop some kind of HTTP proxy that could sit in front of our applications and generate suitable events; when a request comes in, it could effectively ‘put the request on hold’ if there are no backends available to process them, and generate a suitable event to cause the service to scale up and then handle the request.

This seems a promising approach, but before I set about developing such a thing, I wanted to see if there was something else out there that could already handle it.

Enter KEDA⌗

Well, what would you know, apparently there is - KEDA. Like KNative, KEDA is fundamentally an event-driven autoscaling tool, but it seems there is a plugin - KEDA HTTP designed to do exactly what I need. So, let’s see if it works…

Installing KEDA & KEDA HTTP⌗

I’ll install the KEDA operator and associated components in the keda-scaler namespace from the provided Helm chart, like so:

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install --create-namespace -n keda-scaler keda kedacore/keda

OK, so far so good, now let’s try to install the HTTP add-on:

helm install --create-namespace -n keda-scaler http-add-on kedacore/keda-add-ons-http

A quick look at the pods running in our keda-scaler namespace suggests things are going OK so far:

> kubectl get pods
NAME                                                    READY   STATUS    RESTARTS        AGE
keda-add-ons-http-controller-manager-7b4b8bdfc7-ddv9w   2/2     Running   0               41s
keda-add-ons-http-external-scaler-54d5c986fb-cp46g      1/1     Running   0               41s
keda-add-ons-http-external-scaler-54d5c986fb-cqmkd      1/1     Running   0               41s
keda-add-ons-http-external-scaler-54d5c986fb-plb7t      1/1     Running   0               41s
keda-add-ons-http-interceptor-6cd8f677bb-tjxpp          1/1     Running   0               24s
keda-add-ons-http-interceptor-6cd8f677bb-zrg9v          1/1     Running   0               24s
keda-add-ons-http-interceptor-6cd8f677bb-zwkqp          1/1     Running   0               41s
keda-admission-webhooks-554fc8d77f-mx9d2                1/1     Running   0               5m39s
keda-operator-dd878ddf6-27t7v                           1/1     Running   1 (5m20s ago)   5m39s
keda-operator-metrics-apiserver-968bc7cd4-k4gkf         1/1     Running   0               5m39s

OK! So let’s see if we can get it working. We need to create an HTTPScaledObject in our ComfyUI deployment’s namespace.

Configuring KEDA HTTPScaledObject⌗

Note that the specification of the HTTPScaledObject appears to have changed somewhat since the announcement linked above; the current version, which I will use as the basis for my efforts, is 0.8.0, documented here.

Firstly, and unsurprisingly, the KEDA autoscaler doesn’t work directly on Pods, but rather on Deployments. So I need to update my simple ComfyUI accordingly; let’s do that now:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: comfyui
  labels:
    app: comfyui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: comfyui
  template:
    metadata:
      labels:
        app: comfyui
    spec:
      runtimeClassName: nvidia
      tolerations:
      - key: restriction
        operator: "Equal"
        value: "CUDAonly"
        effect: "NoSchedule"
      containers:
      - name: comfyui
        image: registry.svc.snowgoons.ro/snowgoons/comfyui:2024-07-02
        args: [ "--listen", "0.0.0.0" ]
        imagePullPolicy: Always
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8188
          name: comfyui
        volumeMounts:
        - mountPath: /usr/local/ComfyUI/models
          name: model-folder
      volumes:
      - name: model-folder
        hostPath:
          path: /usr/local/ai/models/comfyui

Now we need to craft a scalar configuration:

kind: HTTPScaledObject
apiVersion: http.keda.sh/v1alpha1
metadata:
  name: comfyui
spec:
  scaleTargetRef:
    name: comfyui
    kind: Deployment
    apiVersion: apps/v1
    service: comfyui
    port: 80
  replicas:
    min: 0
    max: 1
  scaledownPeriod: 60
  scalingMetric:
    requestRate:
      window: 1m
      targetValue: 1
      granularity: 1s

What are the important things to note here?

Firstly, the scaleTargetRef identifies two things: the Deployment we plan to scale, and also the Service which we should be intercepting. The port number specified is the port of the service, not the backend port exposed by the backend deployment/pods.

Secondly, we are specifying zero as our minimum number of replicas. And only 1 as the maximum. So essentially our ComfyUI will either be ‘on’ or ‘off’.

What will determine whether or not our deployment is scaled up are the metrics that KEDA tracks - we’re going to use HTTP request rate here. In this case, I say that if there is at least 1 request per minute, keep the service alive - otherwise, you can scale it down to zero.

OK, so let’s deploy:

> kubectl apply -f scaler.yaml
httpscaledobject.http.keda.sh/comfyui created

That seemed easy. I wonder what happened?

> kubectl get pods
No resources found in ai-tests namespace.

What happened to my comfyui pod? I’m hoping this means it scaled it down to zero… Let’s have a look at the deployment:

Name:                   comfyui
Namespace:              ai-tests
CreationTimestamp:      Wed, 03 Jul 2024 18:11:54 +0300
Labels:                 app=comfyui
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app=comfyui
Replicas:               0 desired | 0 updated | 0 total | 0 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:  app=comfyui
  Containers:
   comfyui:
    Image:      registry.svc.snowgoons.ro/snowgoons/comfyui:2024-07-02
    Port:       8188/TCP
    Host Port:  0/TCP
    Args:
      --listen
      0.0.0.0
    Limits:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /usr/local/ComfyUI/models from model-folder (rw)
  Volumes:
   model-folder:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/ai/models/comfyui
    HostPathType:
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   comfyui-db75858f6 (0/0 replicas created)
Events:
  Type    Reason             Age    From                   Message
  ----    ------             ----   ----                   -------
  Normal  ScalingReplicaSet  8m51s  deployment-controller  Scaled up replica set comfyui-db75858f6 to 1
  Normal  ScalingReplicaSet  5m51s  deployment-controller  Scaled down replica set comfyui-db75858f6 to 0 from 1

Look at that! That last log entry is the giveaway: It worked!

OK, so now if I point my browser at my service as before, it should spin up a new instance right?

Wrong. It doesn’t work. Why not? And you may also have noticed some errors like there isn't any valid interceptor endpoint popping up in your keda-add-ons-http-external-scaler pods as well, if you’re the type that actually checks the logs. What’s that all about?

Well, it seems that in fact to intercept our requests, we need to go through the keda-gttp-add-on-interceptor-proxy service that was deployed as part of the KEDA HTTP addon Helm chart.

The easiest way to do this is probably to set up an Ingress that will point to it. Let’s do that…

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: comfyui-ingress
  namespace: keda-scaler
spec:
  ingressClassName: nginx
  rules:
  - host: "comfyui.svc.snowgoons.ro"
    http:
      paths:
      - pathType: Prefix
        path: "/"
        backend:
          service:
            name: keda-add-ons-http-interceptor-proxy
            port:
              number: 8080

Note that the Ingress needs to be specified in the namespace of the keda-scaler, not our target applications namespace, so it can route to the proxy.

Now that we’ve done that, it’s clear that the KEDA proxy also needs some way to work out which backend it will route to. So we need to specify some rules, either path or host based, which tell it. We do that in the HTTPScaledObject specification, by adding hosts: or pathPrefixes: entries; let’s update it:

kind: HTTPScaledObject
apiVersion: http.keda.sh/v1alpha1
metadata:
  name: comfyui
spec:
  hosts:
  - comfyui.svc.snowgoons.ro
  pathPrefixes:
  - /
  scaleTargetRef:
    name: comfyui
    kind: Deployment
    apiVersion: apps/v1
    service: comfyui
    port: 80
  replicas:
    min: 0
    max: 1
  scaledownPeriod: 60
  scalingMetric:
    requestRate:
      window: 1m
      targetValue: 1
      granularity: 1s

You know what to do. kubectl apply, and then let’s point our browser at our ingress address, and see what happens…

Which is: it works! After a brief pause, we got our ComfyUI back!

Let’s see the pods, to make sure we’re not imagining it:

> kgp
NAME                      READY   STATUS    RESTARTS   AGE
comfyui-db75858f6-sk9g2   1/1     Running   0          6s

Oh my! And kubectl describe deployment comfyui?

Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  56m   deployment-controller  Scaled up replica set comfyui-db75858f6 to 1
  Normal  ScalingReplicaSet  53m   deployment-controller  Scaled down replica set comfyui-db75858f6 to 0 from 1
  Normal  ScalingReplicaSet  2m2s  deployment-controller  Scaled up replica set comfyui-db75858f6 to 1 from 0

⚠️ You may be expecting a choir of angels, at this point.

But not quite. Things are not entirely perfect; because, it turns out, the KEDA HTTP addon does not support WebSockets connections, and, well that’s a problem for ComfyUI.

So, for basic, non-websockets HTTP apps, we’re basically there. For my intended use-case, which isn’t ComfyUI but rather LLM chatbot services, this is actually good enough.

And for the WebSockets case? Well, actually there is hope on that front as well; there is an open pull-request which fixes the problem in Keda: https://github.com/kedacore/http-add-on/pull/835…

Postscript: Working around the KEDA WebSockets problem⌗

Arrgh. It’s 8.30 in the evening, and I really should be making something to eat… But I just can’t leave it there. It’s irritating to be 95% of the way there, but not quite…

BUT; I thought of a workaround. It’s pretty unusual for a website to use lots of WebSocket connections, right? Usually there will be one, maybe two, and the rest of the content on the page will be delivered by boring old HTTP. What if we could route the plain-ol’-HTTP connections through KEDA, but divert the WebSocket ones so they go directly to the backend - maybe that could fix it?

Let’s try. Using the inspector in my browser tells me that the WS URL that ComfyUI is trying to access is /ws. We should be able to make an exception for that in our Ingress config so that skips KEDA and goes direct to the service:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: comfyui-ingress
  namespace: keda-scaler
spec:
  ingressClassName: nginx
  rules:
  - host: "comfyui.svc.snowgoons.ro"
    http:
      paths:
      - pathType: Exact
        path: "/ws"
        backend:
          service:
            name: comfyui-bypass-interceptor-proxy
            port:
              number: 8080
      - pathType: Prefix
        path: "/"
        backend:
          service:
            name: keda-add-ons-http-interceptor-proxy
            port:
              number: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: comfyui-bypass-interceptor-proxy
  namespace: keda-scaler
spec:
  type: ExternalName
  externalName: comfyui.ai-tests.svc.cluster.local
  ports:
  - port: 8080
    targetPort: 80

Note, one important thing; the Ingress expects all its backends to live in the same namespace as the Ingress declaration - in our case, that’s the keda-scaler namespace, not the namespace I deployed Comfy in (ai-tests). So we need an extra Service object of type ExternalName which will allow the Ingress to “cross namespaces”.

Apply the changes, cross fingers, try to hit our service’s URL, and…

Note, of course, that if you use this workaround, any requests that go direct to the origin and bypass the interceptor will not be counted when KEDA makes its decision to scale up or down.

AI on-demand with Kubernetes (Part 2)