Editing
CCU:New GPU Cluster
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Getting started on the new cluster == === Login to the new cluster and update your kubeconfig === The frontend for the cluster and login services is located here: https://ccu-k8s.inf.uni-konstanz.de/ Please choose "login to the cluster" and enter your credentials to obtain the kubeconfig data. Choose "full kubeconfig" on the left for all the details you need. Either backup your old kubeconfig and use this as a new one, or merge them both into a new kubeconfig which allows you to easily switch context between both clusters. In the beginning, this might be useful as you maybe have forgotten some data, and also still need to clean up once everything works. A kubeconfig for both clusters has the following structure (note this needs to be saved in "~/.kube/config"): <syntaxhighlight> apiVersion: v1 clusters: - cluster: certificate-authority-data: LS0tLS1CRUdJTiBDRV ... <many more characters> server: https://134.34.224.84:6443 name: ccu-old - cluster: certificate-authority-data: LS0tLS1CRUdJTiBDRV ... <many more characters> server: https://ccu-k8s.inf.uni-konstanz.de:7443 name: ccu-new contexts: - context: cluster: ccu-old namespace: exc-cb user: credentials-old name: ccu-old - context: cluster: ccu-new namespace: <your-namespace> user: credentials-new name: ccu-new current-context: ccu-new kind: Config preferences: {} users: - name: credentials-old <all the data below your username returned from the old loginapp goes here> - name: credentials-new <all the data below your username returned from the new loginapp goes here> </syntaxhighlight> Both the long CA data string and user credentials are returned from the respective loginapps of the clusters. Note: the CA data is different for both clusters, although the first couple of characters are the same. If you have created such a kubeconfig for multiple contexts, you can easily switch between the clusters: <syntaxhighlight> > kubectl config use-context ccu-old > <... work with old cluster> > kubectl config use-context ccu-new > <... work with new cluster> </syntaxhighlight> Defining different contexts is also a good way to switch between namespaces or users (which should not be necessary for the average user). === Running the first test container on the new cluster === After login and adjusting the kubeconfig to the new cluster and user namespace, you should be able to start your first pod. The following example pod mounts the ceph filesystems into an Ubuntu container image. Remember to fill in the placeholder <your-username> for your home directory below. <syntaxhighlight lang="bash"> apiVersion: v1 kind: Pod metadata: name: access-pod spec: containers: - name: ubuntu image: ubuntu:20.04 command: ["sleep", "1d"] resources: requests: cpu: 100m memory: 100Mi limits: cpu: 1 memory: 1Gi volumeMounts: - mountPath: /abyss/home name: cephfs-home readOnly: false - mountPath: /abyss/shared name: cephfs-shared readOnly: false - mountPath: /abyss/datasets name: cephfs-datasets readOnly: true volumes: - name: cephfs-home hostPath: path: "/cephfs/abyss/home/<your-username>" type: Directory - name: cephfs-shared hostPath: path: "/cephfs/abyss/shared" type: Directory - name: cephfs-datasets hostPath: path: "/cephfs/abyss/datasets" type: Directory </syntaxhighlight> Save this e.g. into a "access-pod.yaml", start the pod and verify that it has been created correcly and the filesystems have been mounted successfully, for example with the below commands. You can also check whether you can access the data you have copied over and copy/move it somewhere safe in your private home directory. If you have a large dataset which is probably useful for several people, please contact me so I can move it to the static read-only tree for datasets. <syntaxhighlight lang="bash"> > kubectl apply -f access-pod.yaml > kubectl get pods > kubectl describe pod access-pod > kubectl exec -it access-pod /bin/bash $ ls /abyss/shared/<the directory you created for your data> </syntaxhighlight> === Moving your workloads to the new cluster === You can now verify that you can start a GPU-enabled pod. Try to create a pod with the following specs to allocate 1 GPU for you somewhere on the cluster. The pod comes with an immediately usable installation of Tensorflow 2.0. Note that defining resource requests and limits is now mandatory. <syntaxhighlight> apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: gpu-container image: nvcr.io/nvidia/tensorflow:20.09-tf2-py3 command: ["sleep", "1d"] resources: requests: cpu: 1 nvidia.com/gpu: 1 memory: 10Gi limits: cpu: 1 nvidia.com/gpu: 1 memory: 10Gi volumeMounts: - mountPath: /abyss/home name: cephfs-home readOnly: false - mountPath: /abyss/shared name: cephfs-shared readOnly: false - mountPath: /abyss/datasets name: cephfs-datasets readOnly: true volumes: - name: cephfs-home hostPath: path: "/cephfs/abyss/home/<username>" type: Directory - name: cephfs-shared hostPath: path: "/cephfs/abyss/shared" type: Directory - name: cephfs-datasets hostPath: path: "/cephfs/abyss/datasets" type: Directory </syntaxhighlight> '''Please note (very important): The versions 20.09 of the container images on nvcr.io work on all hosts in the cluster. While there are newer images available, they require drivers >= 455, which are not available for all machines yet. So please stick to 20.09 unless you target a very specific host.''' I will soon provide a table with driver versions for all hosts once they are upgraded and moved to the new cluster. You can again switch to a shell in the container and verify GPU capabilities: <syntaxhighlight> > kubectl exec -it gpu-pod -- /bin/bash # nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 A100-SXM4-40GB Off | 00000000:C1:00.0 Off | 0 | | N/A 27C P0 51W / 400W | 4MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ </syntaxhighlight> Combine with the volume mounts above, and you already have a working environment. For example, you could transfer some code and data of yours to your home directory, and run it in interactive mode in the container as a quick test. Remember to adjust paths to data sets or to mount the directories in the locations expected by your code. <syntaxhighlight> > kubectl exec -it gpu-pod -- /bin/bash # cd /abyss/home/<your-code-repo> # python ./main.py </syntaxhighlight> Note that there are timeouts in place - this is a demo pod which runs only for 24 hours and an interactive session also has a time limit, so it is better to build a custom run script which is executed when the container in the pod starts. A job is a wrapper for a pod spec, which can for example make sure that the pod is restarted until it has at least one successful completion. This is useful for long deep learning work loads, where a pod failure might happen in between (for example due to a node reboot). See [https://kubernetes.io/docs/concepts/workloads/pods/ Kubernetes docs for pods] or [https://kubernetes.io/docs/concepts/workloads/controllers/job/ jobs] for more details. If you do not have your code ready, you can do a quick test if GPU execution works by running demo code from [https://github.com/dragen1860/TensorFlow-2.x-Tutorials this tutorial] as follows: <syntaxhighlight> > kubectl exec -it gpu-pod -- /bin/bash # cd /abyss/home # git clone https://github.com/dragen1860/TensorFlow-2.x-Tutorials.git # cd TensorFlow-2.x-Tutorials/12_VAE # ls README.md images main.py variational_autoencoder.png # pip3 install pillow matplotlib # python ./main.py </syntaxhighlight> === Cleaning up === Once everything works for you on the new cluster, please clean up your presence on the old one. In particular: * Delete all running pods * Delete all persistent volume claims. This is the most important step, as it shows me which of the local filesystems of the nodes are not in use anymore, so I can transfer the node over to the new cluster.
Summary:
Please note that all contributions to Collective Computational Unit may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
CCU:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Project page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Collective Computational Unit
Main page
Projects
Tutorials
GPU Cluster
Core Facilitys
Mediawiki
Recent changes
Random page
Help
Tools
What links here
Related changes
Page information