Editing
CCU:New GPU Cluster
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Moving your workloads to the new cluster === You can now verify that you can start a GPU-enabled pod. Try to create a pod with the following specs to allocate 1 GPU for you somewhere on the cluster. The pod comes with an immediately usable installation of Tensorflow 2.0. Note that defining resource requests and limits is now mandatory. <syntaxhighlight> apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: gpu-container image: nvcr.io/nvidia/tensorflow:20.09-tf2-py3 command: ["sleep", "1d"] resources: requests: cpu: 1 nvidia.com/gpu: 1 memory: 10Gi limits: cpu: 1 nvidia.com/gpu: 1 memory: 10Gi volumeMounts: - mountPath: /abyss/home name: cephfs-home readOnly: false - mountPath: /abyss/shared name: cephfs-shared readOnly: false - mountPath: /abyss/datasets name: cephfs-datasets readOnly: true volumes: - name: cephfs-home hostPath: path: "/cephfs/abyss/home/<username>" type: Directory - name: cephfs-shared hostPath: path: "/cephfs/abyss/shared" type: Directory - name: cephfs-datasets hostPath: path: "/cephfs/abyss/datasets" type: Directory </syntaxhighlight> '''Please note (very important): The versions 20.09 of the container images on nvcr.io work on all hosts in the cluster. While there are newer images available, they require drivers >= 455, which are not available for all machines yet. So please stick to 20.09 unless you target a very specific host.''' I will soon provide a table with driver versions for all hosts once they are upgraded and moved to the new cluster. You can again switch to a shell in the container and verify GPU capabilities: <syntaxhighlight> > kubectl exec -it gpu-pod -- /bin/bash # nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 A100-SXM4-40GB Off | 00000000:C1:00.0 Off | 0 | | N/A 27C P0 51W / 400W | 4MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ </syntaxhighlight> Combine with the volume mounts above, and you already have a working environment. For example, you could transfer some code and data of yours to your home directory, and run it in interactive mode in the container as a quick test. Remember to adjust paths to data sets or to mount the directories in the locations expected by your code. <syntaxhighlight> > kubectl exec -it gpu-pod -- /bin/bash # cd /abyss/home/<your-code-repo> # python ./main.py </syntaxhighlight> Note that there are timeouts in place - this is a demo pod which runs only for 24 hours and an interactive session also has a time limit, so it is better to build a custom run script which is executed when the container in the pod starts. A job is a wrapper for a pod spec, which can for example make sure that the pod is restarted until it has at least one successful completion. This is useful for long deep learning work loads, where a pod failure might happen in between (for example due to a node reboot). See [https://kubernetes.io/docs/concepts/workloads/pods/ Kubernetes docs for pods] or [https://kubernetes.io/docs/concepts/workloads/controllers/job/ jobs] for more details. If you do not have your code ready, you can do a quick test if GPU execution works by running demo code from [https://github.com/dragen1860/TensorFlow-2.x-Tutorials this tutorial] as follows: <syntaxhighlight> > kubectl exec -it gpu-pod -- /bin/bash # cd /abyss/home # git clone https://github.com/dragen1860/TensorFlow-2.x-Tutorials.git # cd TensorFlow-2.x-Tutorials/12_VAE # ls README.md images main.py variational_autoencoder.png # pip3 install pillow matplotlib # python ./main.py </syntaxhighlight>
Summary:
Please note that all contributions to Collective Computational Unit may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
CCU:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Project page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Collective Computational Unit
Main page
Projects
Tutorials
GPU Cluster
Core Facilitys
Mediawiki
Recent changes
Random page
Help
Tools
What links here
Related changes
Page information