Editing
CCU:GPU Cluster Quick Start
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Running actual workloads on the cluster == You can now verify that you can start a GPU-enabled pod. Try to create a pod with the following specs to allocate 1 GPU for you somewhere on the cluster. The image we use is provided by nVidia and has Tensorflow/Keras pre-installed. There are many other useful base images around which you can use instead. <syntaxhighlight> apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: gpu-container image: nvcr.io/nvidia/tensorflow:20.09-tf2-py3 command: ["sleep", "1d"] resources: requests: cpu: 1 nvidia.com/gpu: 1 memory: 10Gi limits: cpu: 1 nvidia.com/gpu: 1 memory: 10Gi volumeMounts: - mountPath: /abyss/home name: cephfs-home readOnly: false - mountPath: /abyss/shared name: cephfs-shared readOnly: false - mountPath: /abyss/datasets name: cephfs-datasets readOnly: true volumes: - name: cephfs-home hostPath: path: "/cephfs/abyss/home/<username>" type: Directory - name: cephfs-shared hostPath: path: "/cephfs/abyss/shared" type: Directory - name: cephfs-datasets hostPath: path: "/cephfs/abyss/datasets" type: Directory </syntaxhighlight> See [https://www.nvidia.com/en-us/gpu-cloud/containers/ the catalog of containers by nVidia] for more options for base images (e.g. [https://ngc.nvidia.com/catalog/containers/nvidia:pytorch PyTorch]), or Google around for containers of your favourite application. '''Make sure you only run containers from trusted sources!''' '''Please note (very important): The versions 20.09 of the deep learning frameworks on nvcr.io work on all hosts in the cluster. While there are newer images available, they require drivers >= 455, which are not available for all machines yet. For guaranteed compability, you must stick to 20.09, but you can target a specific host with newer drivers.''' At the bottom of the GPU cluster status page, there is the nvidia-smi output for each node, where you can check individual driver and CUDA version. You can also switch to a shell in the container and verify GPU capabilities: <syntaxhighlight> > kubectl apply -f gpu-pod.yaml ... wait until pod is created, check with "kubectl describe pod gpu-pod" or "kubectl get pods" > kubectl exec -it gpu-pod -- /bin/bash # nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 A100-SXM4-40GB Off | 00000000:C1:00.0 Off | 0 | | N/A 27C P0 51W / 400W | 4MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ </syntaxhighlight> To check compabitility with specific nVidia containers, please refer to the [https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html official compatibility matrix]. Note that all nodes have datacenter drivers installed, which should give a large amount of compability. If in doubt, just try it out. Combine with the volume mounts above, and you already have a working environment. For example, you could transfer some code and data of yours to your home directory, and run it in interactive mode in the container as a quick test. Remember to adjust paths to data sets or to mount the directories in the locations expected by your code. <syntaxhighlight> > kubectl exec -it gpu-pod -- /bin/bash # cd /abyss/home/<your-code-repo> # python ./main.py </syntaxhighlight> Note that there are timeouts in place - this is a demo pod which runs only for 24 hours and an interactive session also has a time limit, so it is better to build a custom run script which is executed when the container in the pod starts. A job is a wrapper for a pod spec, which can for example make sure that the pod is restarted until it has at least one successful completion. This is useful for long deep learning work loads, where a pod failure might happen in between (for example due to a node reboot). See [https://kubernetes.io/docs/concepts/workloads/pods/ Kubernetes docs for pods] or [https://kubernetes.io/docs/concepts/workloads/controllers/job/ jobs] for more details. If you do not have your code ready, you can do a quick test if GPU execution works by running demo code from [https://github.com/dragen1860/TensorFlow-2.x-Tutorials this tutorial] as follows: <syntaxhighlight> > kubectl exec -it gpu-pod -- /bin/bash # cd /abyss/home # git clone https://github.com/dragen1860/TensorFlow-2.x-Tutorials.git # cd TensorFlow-2.x-Tutorials/12_VAE # ls README.md images main.py variational_autoencoder.png # pip3 install pillow matplotlib # python ./main.py </syntaxhighlight> Remember to clean up resources which you are not using anymore, this includes pods and jobs. For example, when your pod has finished what ever it is supposed to be doing, run <syntaxhighlight> > kubectl delete -f gpu-pod.yaml </syntaxhighlight> using the same manifest file you used to create the resource with kubectl apply.
Summary:
Please note that all contributions to Collective Computational Unit may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
CCU:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Project page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Collective Computational Unit
Main page
Projects
Tutorials
GPU Cluster
Core Facilitys
Mediawiki
Recent changes
Random page
Help
Tools
What links here
Related changes
Page information