Tag: Slurm
-
Using Prism
Prism is equipped with several powerful servers built specifically for accelerating AI/ML/DL workloads with GPUs. This cutting edge platform is easy to access and the preinstalled software/libraries provide foundational tools that enable scientists to maximize their workflows. Environment System Sockets CPU Cores per socket Total CPU Cores CPU Memory NVME Storage (TB) GPUs and GPU…
-
Slurm on ADAPT
In order to fairly distribute user jobs across shared resources, some of our VM clusters on ADAPT are equipped with Slurm. With Slurm, users can run both interactive and non-interactive jobs on specified resources without having to worry about interference from other user workloads. When resources aren’t readily available, Slurm also will queue your jobs…
-
Miscellaneous Topics
Modules Learn how to use the module command to set up your Discover cluster environment with available compilers, interpreters and other software packages. Cron on Discover Automate running your tasks at specific time intervals on the Discover cluster using cron.
-
Monitoring Jobs on Discover using slurm
Query jobs using squeue To see the status of your job, “squeue” queries the current job queue and lists its contents. Useful options include: -a which lists all jobs -t R which lists all running jobs -t PD which lists all pending (non-running) jobs -p datamove which lists all jobs in the datamove partition -j…
-
Discover CSS Access through Slurm
CSS read-only access on Discover is provided to a subset of Discover’s Slurm-managed compute nodes. These are limited to Scalable Unit 16, which includes two different node types: 676 CPU-only nodes with Intel “Cascade Lake” CPU architecture, and twelve nodes with AMD “Rome” CPUs combined with NVIDIA A100 GPUs and Scalable Units 17 and 18,…
-
System Status
Discover Job Status Due to changes in Discover’s reporting processes, system hardware, and resource allocation, the information on the jobmon page is no longer accurate so we have removed it while we investigate a more scalable and flexible solution. In the interim, you may use the following command to get a rough idea of when…
-
Multiple Jobs per Node
Background The number of CPUs (aka “cores”) per node among Discover’s processor architectures has continued to increase over time, with current Milan processors having 128 cores. Skylake and Cascade Lake nodes offer 40 and 46 cores per node respectively. Many NCCS users have legitimate use cases for significantly lower core counts per job (particularly for…
-
File System on Discover Cluster
The Discover cluster provides several different types of file systems: home, nobackup, and temporary/scratch. See the showquota documentation for information on how to monitor your storage usage. File System Type Variable on Discover cluster Default Quota Backup Cycles Home Directory IBM GPFS $HOME 1GB Daily Scratch IBM GPFS $NOBACKUP 5Tb/300k inodes No Backups Scratch local $LOCAL_TMPDIR node…
-
Discover GPU Partition
GPU Availability Within The Discover Cluster Scalable Unit 16 (SCU16) makes GPU resources available within the NCCS Discover cluster’s gpu_a100 partition, which comprises 10 AMD nodes that each include: Note: These nodes will be fully shared, with individual nodes running jobs belonging to multiple users. Each user is limited to a maximum of one node…
-
Discover Quality of Service Details
Slurm’s Quality of Service (QoS) feature controls resource limits for every job in the Discover job queue. Available QoSs in the table below apply only to jobs submitted to the Slurm default partition. (It is important for maximum adaptability of your job scripts that you not specify any partition if you wish to use the…

