Editing Cluster:Compute nodes (section)

== Targeting more powerful GPUs ==

By default, Kubernetes schedules GPU pods only on the smallest class of GPU (nVidia Titan). The way how this is achieved is that nodes with higher grade GPUs are assigned a "node taint", which makes the node only available to pods which specify that they are "tolerant" against the taint.

So if your tasks for example requires a GPU with *exactly* 32 GB, you have to

# make the pod tolerate the taint "gpumem=32:NoSchedule" (see table below).
# make the pod require the node label "gpumem" to be exactly 32.

See the [https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ Kubernetes API documentation on taints and tolerations] for more details. 


Example:
<syntaxhighlight>
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  nodeSelector:
    gpumem: "32"
  tolerations:
  - key: "gpumem"
    # Note: to be able to run on a GPU with any amount of memory, 
    # replace the operator/value pair by just 'operator: "Exists"'.
    operator: "Equal"
    value: "32"
    effect: "NoSchedule"
  containers:
  - name: gpu-container
    image: nvcr.io/nvidia/tensorflow:20.09-tf2-py3
    command: ["sleep", "1d"]
    resources:
      requests:
        cpu: 1
        nvidia.com/gpu: 1
        memory: 10Gi
      limits:
        cpu: 1
        nvidia.com/gpu: 1
        memory: 10Gi
   # more specs (volumes etc.)
</syntaxhighlight>


If you need a GPU with *at least* 32 GB, but also would be happy with more, you just can tolerate any amount. Then,
make the pod require the node label "gpumem" to be larger than 31.

Note: typically, you should *not* do this and reserve a GPU which has just enough memory. However, if e.g. all 32 GB GPUs are busy already, you can move up to a 40 GB GPU.

Example:
<syntaxhighlight>
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  # the standard node selector is insufficient here.
  # needs to use the more expressive "nodeAffinity".
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpumem
            operator: Gt
            value: 31
        # note: this also works to specify a minimum compute capability
        - matchExpressions:
          - key: nvidia-compute-capability-sm
            operator: Gt
            value: 79
  tolerations:
  - key: "gpumem"
    operator: "Exists"
    effect: "NoSchedule"
  # ... rest of the specs like before
</syntaxhighlight>