GPU Resources: Difference between revisions
(Created page with "When submitting jobs, you can ask for GPUs in one of two ways. One is: #SBATCH --gres=gpu:1 That will ask for 1 GPU generically on a node with a free GPU. This request is...") |
(Explain how to actually use GPUs and what won't work) |
||
Line 15: | Line 15: | ||
nVidia GeForce RTX 2080 Ti : 11GB RAM | nVidia GeForce RTX 2080 Ti : 11GB RAM | ||
nVidia GeForce RTX 1080 Ti : 11GB RAM | nVidia GeForce RTX 1080 Ti : 11GB RAM | ||
==Using GPUs== | |||
The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi, but they do not have the full CUDA Toolkit, and they do not have the nvcc CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user. | |||
To actually use a GPU, you need to run a program that uses the CUDA API. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA without needing a compiler. | |||
You can also run containers on the cluster using Singularity, and give them access to GPUs using the --nv option. For example: | |||
singularity pull docker://tensorflow/tensorflow:latest-gpu | |||
srun -c 8 --mem 10G --gres=gpu:1 --exclude phoenix-01 /usr/bin/singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())' | |||
This will produce output showing that the Tensorflow container is indeed able to talk to one GPU: | |||
INFO: Using cached SIF image | |||
2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. | |||
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. | |||
2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 | |||
[name: "/device:CPU:0" | |||
device_type: "CPU" | |||
memory_limit: 268435456 | |||
locality { | |||
} | |||
incarnation: 8527638019084870106 | |||
xla_global_id: -1 | |||
, name: "/device:GPU:0" | |||
device_type: "GPU" | |||
memory_limit: 23324655616 | |||
locality { | |||
bus_id: 1 | |||
links { | |||
} | |||
} | |||
incarnation: 1860154623440434360 | |||
physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" | |||
xla_global_id: 416903419 | |||
] | |||
You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Unfortunately, the Docker daemon is not installed on the cluster nodes, and so Docker is not available. Instead, Singularity can run most Docker containers, without requiring users to be able to manipulate a highly-privileged daemon. | |||
Slurm itself also supports a --container option for jobs, which allows a whole job to be run inside a container. If you are able to convert your container to OCI Bundle format, you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk, and the tools to download a Docker image from Docker Hub in OCI bundle format (skopeo and umoci) are not installed on the cluster. |
Revision as of 18:44, 15 May 2023
When submitting jobs, you can ask for GPUs in one of two ways. One is:
#SBATCH --gres=gpu:1
That will ask for 1 GPU generically on a node with a free GPU. This request is more specific:
#SBATCH --gres=gpu:A5500:3
That requests 3 A5500 GPUs only.
We have several GPU types on the cluster which may fit your specific needs:
nVidia RTX A5500 : 24GB RAM nVidia A100 : 80GB RAM nVidia GeForce RTX 2080 Ti : 11GB RAM nVidia GeForce RTX 1080 Ti : 11GB RAM
Using GPUs
The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi, but they do not have the full CUDA Toolkit, and they do not have the nvcc CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user.
To actually use a GPU, you need to run a program that uses the CUDA API. Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA without needing a compiler.
You can also run containers on the cluster using Singularity, and give them access to GPUs using the --nv option. For example:
singularity pull docker://tensorflow/tensorflow:latest-gpu srun -c 8 --mem 10G --gres=gpu:1 --exclude phoenix-01 /usr/bin/singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())'
This will produce output showing that the Tensorflow container is indeed able to talk to one GPU:
INFO: Using cached SIF image 2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8527638019084870106 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 23324655616 locality { bus_id: 1 links { } } incarnation: 1860154623440434360 physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6" xla_global_id: 416903419 ]
You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Unfortunately, the Docker daemon is not installed on the cluster nodes, and so Docker is not available. Instead, Singularity can run most Docker containers, without requiring users to be able to manipulate a highly-privileged daemon.
Slurm itself also supports a --container option for jobs, which allows a whole job to be run inside a container. If you are able to convert your container to OCI Bundle format, you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk, and the tools to download a Docker image from Docker Hub in OCI bundle format (skopeo and umoci) are not installed on the cluster.