Editing CCU:New GPU Cluster (section)

=== Moving your workloads to the new cluster ===

You can now verify that you can start a GPU-enabled pod. Try to create a pod with the following specs to allocate 1 GPU for you somewhere on the cluster. The pod comes with an immediately usable installation of Tensorflow 2.0. Note that defining resource requests and limits is now mandatory.

<syntaxhighlight>
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: gpu-container
    image: nvcr.io/nvidia/tensorflow:20.09-tf2-py3
    command: ["sleep", "1d"]
    resources:
      requests:
        cpu: 1
        nvidia.com/gpu: 1
        memory: 10Gi
      limits:
        cpu: 1
        nvidia.com/gpu: 1
        memory: 10Gi
    volumeMounts:
      - mountPath: /abyss/home
        name: cephfs-home
        readOnly: false
      - mountPath: /abyss/shared
        name: cephfs-shared
        readOnly: false
      - mountPath: /abyss/datasets
        name: cephfs-datasets
        readOnly: true
  volumes:
    - name: cephfs-home
      hostPath:
        path: "/cephfs/abyss/home/<username>"
        type: Directory
    - name: cephfs-shared
      hostPath:
        path: "/cephfs/abyss/shared"
        type: Directory
    - name: cephfs-datasets
      hostPath:
        path: "/cephfs/abyss/datasets"
        type: Directory
</syntaxhighlight>

'''Please note (very important): The versions 20.09 of the container images on nvcr.io work on all hosts in the cluster. While there are newer images available, they require drivers >= 455, which are not available for all machines yet. So please stick to 20.09 unless you target a very specific host.''' I will soon provide a table with driver versions for all hosts once they are upgraded and moved to the new cluster.

You can again switch to a shell in the container and verify GPU capabilities:

<syntaxhighlight>
> kubectl exec -it gpu-pod -- /bin/bash
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      Off  | 00000000:C1:00.0 Off |                    0 |
| N/A   27C    P0    51W / 400W |      4MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
</syntaxhighlight>


Combine with the volume mounts above, and you already have a working environment. For example, you could transfer some code and data of yours to your home directory, and run it in interactive mode in the container as a quick test. Remember to adjust paths to data sets or to mount the directories in the locations expected by your code.

<syntaxhighlight>
> kubectl exec -it gpu-pod -- /bin/bash
# cd /abyss/home/<your-code-repo>
# python ./main.py
</syntaxhighlight>

Note that there are timeouts in place - this is a demo pod which runs only for 24 hours and an interactive session also has a time limit, so it is better to build a custom run script which is executed when the container in the pod starts. A job is a wrapper for a pod spec, which can for example make sure that the pod is restarted until it has at least one successful completion. This is useful for long deep learning work loads, where a pod failure might happen in between (for example due to a node reboot). See [https://kubernetes.io/docs/concepts/workloads/pods/ Kubernetes docs for pods] or [https://kubernetes.io/docs/concepts/workloads/controllers/job/ jobs] for more details.

If you do not have your code ready, you can do a quick test if GPU execution works by running demo code from [https://github.com/dragen1860/TensorFlow-2.x-Tutorials this tutorial] as follows:

<syntaxhighlight>
> kubectl exec -it gpu-pod -- /bin/bash
# cd /abyss/home
# git clone https://github.com/dragen1860/TensorFlow-2.x-Tutorials.git
# cd TensorFlow-2.x-Tutorials/12_VAE
# ls
README.md  images  main.py  variational_autoencoder.png
# pip3 install pillow matplotlib
# python ./main.py
</syntaxhighlight>