Editing CCU:New GPU Cluster (section)

== Getting started on the new cluster ==

=== Login to the new cluster and update your kubeconfig ===

The frontend for the cluster and login services is located here:

https://ccu-k8s.inf.uni-konstanz.de/

Please choose "login to the cluster" and enter your credentials to obtain the kubeconfig data. Choose "full kubeconfig" on the left for all the details you need. Either backup your old kubeconfig and use this as a new one, or merge them both into a new kubeconfig which allows you to easily switch context between both clusters. In the beginning, this might be useful as you maybe have forgotten some data, and also still need to clean up once everything works.

A kubeconfig for both clusters has the following structure (note this needs to be saved in "~/.kube/config"):

<syntaxhighlight>
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS1CRUdJTiBDRV ... <many more characters>
    server: https://134.34.224.84:6443
  name: ccu-old
- cluster:
    certificate-authority-data: LS0tLS1CRUdJTiBDRV ... <many more characters>
    server: https://ccu-k8s.inf.uni-konstanz.de:7443
  name: ccu-new
contexts:
- context:
    cluster: ccu-old
    namespace: exc-cb
    user: credentials-old
  name: ccu-old
- context:
    cluster: ccu-new
    namespace: <your-namespace>
    user: credentials-new
  name: ccu-new
current-context: ccu-new
kind: Config
preferences: {}
users:
- name: credentials-old
  <all the data below your username returned from the old loginapp goes here>
- name: credentials-new
  <all the data below your username returned from the new loginapp goes here>
</syntaxhighlight>


Both the long CA data string and user credentials are returned from the respective loginapps of the clusters. Note: the CA data is different for both clusters, although the first couple of characters are the same.

If you have created such a kubeconfig for multiple contexts, you can easily switch between the clusters:

<syntaxhighlight>
> kubectl config use-context ccu-old
> <... work with old cluster>
> kubectl config use-context ccu-new
> <... work with new cluster>
</syntaxhighlight>

Defining different contexts is also a good way to switch between namespaces or users (which should not be necessary for the average user).

=== Running the first test container on the new cluster ===

After login and adjusting the kubeconfig to the new cluster and user namespace, you should be able to start your first pod. The following example pod mounts the ceph filesystems into an Ubuntu container image. Remember to fill in the placeholder <your-username> for your home directory below.

<syntaxhighlight lang="bash">
apiVersion: v1
kind: Pod
metadata:
  name: access-pod
spec:
  containers:
  - name: ubuntu
    image: ubuntu:20.04
    command: ["sleep", "1d"]
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 1
        memory: 1Gi
    volumeMounts:
      - mountPath: /abyss/home
        name: cephfs-home
        readOnly: false
      - mountPath: /abyss/shared
        name: cephfs-shared
        readOnly: false
      - mountPath: /abyss/datasets
        name: cephfs-datasets
        readOnly: true
  volumes:
    - name: cephfs-home
      hostPath:
        path: "/cephfs/abyss/home/<your-username>"
        type: Directory
    - name: cephfs-shared
      hostPath:
        path: "/cephfs/abyss/shared"
        type: Directory
    - name: cephfs-datasets
      hostPath:
        path: "/cephfs/abyss/datasets"
        type: Directory
</syntaxhighlight>





Save this e.g. into a "access-pod.yaml", start the pod and verify that it has been created correcly and the filesystems have been mounted successfully, for example with the below commands. You can also check whether you can access the data you have copied over and copy/move it somewhere safe in your private home directory. If you have a large dataset which is probably useful for several people, please contact me so I can move it to the static read-only tree for datasets.

<syntaxhighlight lang="bash">
> kubectl apply -f access-pod.yaml
> kubectl get pods
> kubectl describe pod access-pod
> kubectl exec -it access-pod /bin/bash
$ ls /abyss/shared/<the directory you created for your data>
</syntaxhighlight>

=== Moving your workloads to the new cluster ===

You can now verify that you can start a GPU-enabled pod. Try to create a pod with the following specs to allocate 1 GPU for you somewhere on the cluster. The pod comes with an immediately usable installation of Tensorflow 2.0. Note that defining resource requests and limits is now mandatory.

<syntaxhighlight>
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: gpu-container
    image: nvcr.io/nvidia/tensorflow:20.09-tf2-py3
    command: ["sleep", "1d"]
    resources:
      requests:
        cpu: 1
        nvidia.com/gpu: 1
        memory: 10Gi
      limits:
        cpu: 1
        nvidia.com/gpu: 1
        memory: 10Gi
    volumeMounts:
      - mountPath: /abyss/home
        name: cephfs-home
        readOnly: false
      - mountPath: /abyss/shared
        name: cephfs-shared
        readOnly: false
      - mountPath: /abyss/datasets
        name: cephfs-datasets
        readOnly: true
  volumes:
    - name: cephfs-home
      hostPath:
        path: "/cephfs/abyss/home/<username>"
        type: Directory
    - name: cephfs-shared
      hostPath:
        path: "/cephfs/abyss/shared"
        type: Directory
    - name: cephfs-datasets
      hostPath:
        path: "/cephfs/abyss/datasets"
        type: Directory
</syntaxhighlight>

'''Please note (very important): The versions 20.09 of the container images on nvcr.io work on all hosts in the cluster. While there are newer images available, they require drivers >= 455, which are not available for all machines yet. So please stick to 20.09 unless you target a very specific host.''' I will soon provide a table with driver versions for all hosts once they are upgraded and moved to the new cluster.

You can again switch to a shell in the container and verify GPU capabilities:

<syntaxhighlight>
> kubectl exec -it gpu-pod -- /bin/bash
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      Off  | 00000000:C1:00.0 Off |                    0 |
| N/A   27C    P0    51W / 400W |      4MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
</syntaxhighlight>


Combine with the volume mounts above, and you already have a working environment. For example, you could transfer some code and data of yours to your home directory, and run it in interactive mode in the container as a quick test. Remember to adjust paths to data sets or to mount the directories in the locations expected by your code.

<syntaxhighlight>
> kubectl exec -it gpu-pod -- /bin/bash
# cd /abyss/home/<your-code-repo>
# python ./main.py
</syntaxhighlight>

Note that there are timeouts in place - this is a demo pod which runs only for 24 hours and an interactive session also has a time limit, so it is better to build a custom run script which is executed when the container in the pod starts. A job is a wrapper for a pod spec, which can for example make sure that the pod is restarted until it has at least one successful completion. This is useful for long deep learning work loads, where a pod failure might happen in between (for example due to a node reboot). See [https://kubernetes.io/docs/concepts/workloads/pods/ Kubernetes docs for pods] or [https://kubernetes.io/docs/concepts/workloads/controllers/job/ jobs] for more details.

If you do not have your code ready, you can do a quick test if GPU execution works by running demo code from [https://github.com/dragen1860/TensorFlow-2.x-Tutorials this tutorial] as follows:

<syntaxhighlight>
> kubectl exec -it gpu-pod -- /bin/bash
# cd /abyss/home
# git clone https://github.com/dragen1860/TensorFlow-2.x-Tutorials.git
# cd TensorFlow-2.x-Tutorials/12_VAE
# ls
README.md  images  main.py  variational_autoencoder.png
# pip3 install pillow matplotlib
# python ./main.py
</syntaxhighlight>

=== Cleaning up ===

Once everything works for you on the new cluster, please clean up your presence on the old one.

In particular:

* Delete all running pods 
* Delete all persistent volume claims. This is the most important step, as it shows me which of the local filesystems of the nodes are not in use anymore, so I can transfer the node over to the new cluster.