Using Slurm

General Slurm Commands

sbatch
—sbatch – Submit a batch script to Slurm.
squeue
—squeue – View information about jobs located in the Slurm scheduling queue.
sinfo
—sinfo – View Slurm management information about nodes and partitions.
scancel
—scancel – Cancel jobs or job steps that were scheduled using Slurm.
scontrol
—scontrol – View Slurm configuration and state. (Example: @mio001[~]->scontrol show node phi001)

 

Rosetta Stone

rosetta.pdf
Shows mapping between common PBS/Slurm/Load Leveler commands
Slurm Documentation

 

HPC@Mines Specific Commands

Below are some Slurm-related commands created specifically for HPC@Mines.

slurmnodes
View information about available nodes
slurmjobs
View information about queues and running jobs
expands
View a full list of nodes used for a job

“-help” or “-h” options are available for each of the commands

     Examples:
[joeuser@mio001 utility]$ printenv SLURM_NODELIST
compute[004-005]
[joeuser@mio001 utility]$  ./expands  $SLURM_NODELIST
compute004
compute004
compute004
compute004
compute005
compute005
compute005
compute005

Mio-Specific Slurm Commands

On Mio, Slurm schedules jobs on partitions. If you are not particular about the nodes on which your job will run, there is no need to specify a partition. If you would like to run on your group’s nodes or on the phi or GPU nodes, it’s necessary to specify a partition.  To do so, the option -p followed by the partition name must be added to the command to submit.  For example, to run in the phi partition, and thus on the phi nodes, the syntax would be:

sbatch -p phi <script>,

where <script> is replaced by the filename of the runscript, without the brackets.

As of 2022.02.21 13:30:22 MDT the following partitions are defined:

[battelle@mio002 ~]$ sinfo -a -o “%15P %5D %N”

PARTITION NUM OF NODES NODELIST
compute7 202 compute[000-027,030-033,036-041,043-045,049-052,054-059,061-076,078-081,083-103,105-108,110-111,114-172,174-184,186-219]
gpu 1 gpu004
ppc 2 ppc[001-002]
ppc-build 1      ppc001
hkazemi 1 compute080
anewman 1 compute055
asum 8 compute[051-052,094-099]
svyas 10 compute[040-041,043-045,068-072]
jbrune 6 compute[032-033,036-037,100-101]
lcarr 15 compute[024,062-067,073-076,128-129,172,196]
mganesh 13 compute[056-059,061,160-167]
mooney 2 compute[049-050]
nsulliva 6 compute[122-123,132-135]
pankavic 2 compute[026,124]
cpackard 1 compute125
jdzimmer 1 compute027
mlusk 6 compute[038-039,092-093,126-127]
ntilton 4 compute[130-131,202-203]
bgthomas 4 compute[168-171]
cciobanu 3 compute[054,090-091]
ireimani 1 compute102
bkappes 2 compute[174-175]
cdurfee 2 compute[176-177]
kleiderman 3 compute[025,178-179]
gualdron 12 compute[180-184,186-191,197]
hpc 10 compute[078-079,084-089,192-193]
cwp 24 compute[136-159]
rcp 12 compute[012-023]
geop 28 compute[000-011,083,103,105-108,110-111,114-121]
meberhar 2 compute[194-195]
prtaylor 1 gpu004
gbrennec 4 compute[198-201]
tucker 16 compute[204-219]

HPC@Mines Runtime Policies

Walltime Policy:

The standard maximum walltime is six days (144 hours):
#SBATCH –time=144:00:00.

This policy is strictly enforced by HPC@Mines.  In the event that the computational problem you are tasked with solving seems to require a walltime that exceeds 144 hours, HPC@Mines strongly encourages that you find alternative approaches to simply extending walltime.  Below we describe two primary approaches:

  1. Increase the amount of parallelism:
    By increasing the number of cores/nodes used in your job, you can often decrease the total wall time needed.
  2. Incorporate checkpointing:
    Checkpointing is the process of periodically saving the state of an execution so that it can be resumed at a later time.  This is extremely helpful in mitigating the effects on your calculation in the event of an unexpected crash or error.  By saving output periodically, or at a certain recurring point, and being able to restart the calculation using the saved output, a catastrophic loss of an entire days-long compute effort could be avoided.  Using checkpointing to intentionally restart a calculation at a reasonably estimated point is a recommended approach to remain within the six-day maximum walltime.

For more focused computational assistance, with the above situations and other compute aspects of your research, the HPC@Mines team is available and willing to provide personal, one-on-one assistance.  Please submit a help request here to start the process.  We also suggest consulting with members of your group or other peers currently using similar codes or applications; they may provide expedited answers to your questions, based on their experience.