Kubernetes + NVIDIA on K3S

04 Nov 2023

Goal: Setup a Kubernetes node to expose NVIDIA GPU so that GPU loads (AI, Crypto, etc…) can run on Kubernetes:

Platform:

Debian 12
AMD64/x86_64
NVIDIA RTX 3070
Kubernetes (K3S)

What are we trying to do?

Expose a “GPU” capability on K8S nodes that have GPUs

How do we do it?

K3S does most of the work automatically

K3S update/gotcha

You must run an up-to-date Kubernetes. Older versions (then this blog post…) of K3S will error on NVIDIA enabled node:

executing \"compiled_template\" at <.SystemdCgroup>: can't evaluate field SystemdCgroup in type templates.ContainerdRuntimeConfig"

Already fixed: https://github.com/k3s-io/k3s/issues/8754

Workaround to install specific commit from ticket:

curl -sfL https://get.k3s.io | INSTALL_K3S_COMMIT=1ae053d9447229daf8bbd2cd5adf89234e203bcc sh -s - --disable traefik --disable servicelb

Zero to hero

Bare Metal

Reboot as needed, sudo grep -i nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml will match lines if installation was successful.

Kubernetes Node

To expand on the K3S notes above:

1. RuntimeClass

Deploy the RuntimeClass (from K3s docs - essential! - this selects the GPU enabled container runtime that K8S sets up for us)

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia

2. Device Plugin

Deploy the NVIDIA device plugin via Helm and use the RuntimeClass configured above:

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.14.2 \
  --set runtimeClassName=nvidia

# tricks to expose NVIDIA device nodes under /dev. Not normally needed
#  --set deviceListStrategy=volume-mounts
#  --set compatWithCPUManager=true

NVIDIA device plugin should have labelled the node as having an NVIDIA GPU:

kubectl describe node | grep nvidia.com/gpu

3. GPU Feature discovery

Not all GPUs are created equal. This extra pod labels Kubernetes Nodes with GPU Features supported. Deploy NVIDIA GPU Feature Discovery via Helm and configure RuntimeClass again:

helm upgrade -i nvgfd nvgfd/gpu-feature-discovery \
  --version 0.8.2 \
  --namespace gpu-feature-discovery \
  --create-namespace \
  --set runtimeClassName=nvidia

Test labeling by looking for a bunch of new nvidia.com labels on the node:

kubectl describe node |grep nvidia.com

4. Deploy the benchmark pod (sample workload - from k3s)

apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    args: ["nbody", "-gpu", "-benchmark"]
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all

If everything is working:

The pod will run and go to state completed
Pod logs should looks something like:

Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
	-fullscreen       (run n-body simulation in fullscreen mode)
	-fp64             (use double precision floating point values for simulation)
	-hostmem          (stores simulation data in host memory)
	-benchmark        (run benchmark to measure performance)
	-numbodies=<N>    (number of bodies (>= 1) to run in simulation)
	-device=<d>       (where d=0,1,2.... for the CUDA device to use)
	-numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
	-compare          (compares simulation results running once on the default GPU and once on the CPU)
	-cpu              (run n-body simulation on the CPU)
	-tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6

> Compute 8.6 CUDA device: [NVIDIA GeForce RTX 3070]
47104 bodies, total time for 10 iterations: 38.352 ms
= 578.534 billion interactions per second
= 11570.683 single-precision GFLOP/s at 20 flops per interaction

Deploy GPU workloads

The previous step proved that CUDA pods are working so now its time to create your own CUDA containers. Some hints:

Container needs to embed a CUDA runtime. The easiest way to do this is to use the cuda image from NVIDIA
GPU not working in container/strange errors? First step is to run nvidia-smi inside the container. It should give the same output as the node
Strange errors running GPU workloads - check you are using the “right” CUDA version, eg nvidia base image matches what app was compiled against
Example Dockerfile (GPU mining)
Example k8s deployment

---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: crypto
  name: bzminer
spec:
  replicas: 1
  selector:
    matchLabels:
      app: bzminer
  template:
    metadata:
      name: bzminer
      labels:
        app: bzminer
    spec:
      hostname: bzminer
      runtimeClassName: nvidia
      containers:
      - name: bzminer
        image: quay.io/declarativesystems/cryptodaemons_bzminer:17.0.0
        imagePullPolicy: Always
        args:
        - "-a"
        - "meowcoin"
        - "-w"
        - "MGq7UPAASNwzTKWPKjrsrJxyDxpwdvdTr5"
        - "-r"
        - "cloud"
        - "-p"
        - "stratum+tcp://stratum.coinminerz.com:3323"
        - "--nc"
        - "1"
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: all

        resources:
          limits:
            nvidia.com/gpu: 1
      restartPolicy: Always

Improvements

These examples dedicate 1 GPU to 1 workload. Its possible to do timesharing as well so that one GPU can be shared between a bunch of apps. This is left an an exercise for the reader ;-)

geoffwilliams@home:~$