How to use Slurm
Contents
What is Slurm?
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system used by various entities across UCSB. The College of Engineering (CoE) is using Slurm on our GPU cluster to give additional computing options for those in need of heavy calculations.
How to get access to Slurm?
Students
You have a CoE account and be enrolled in a course that has been enrolled in Slurm.
Professors or TAs
To enroll your course with Slurm please send an email to help@engineering.ucsb.edu. Include the course name, number and software needs for the course.
Where can Slurm jobs be run?
Presently Slurm jobs can be run for any of the CoE open access labs that run Linux.
Overview on how to Submit Jobs
Please note that these are generic instructions and exact procedure will vary.
Slurm will take all of your environment variables that your login shell has, so if you need a compiler, or Matlab, etc., do the 'module load' for it before you submit your job.
To run a job is 'sbatch', e.g. you have a file named 'test.slurm' that looks like this (for first a serial, then a parallel job)
#!/bin/bash -l
#Serial (1 core on one node) job...
#SBATCH --nodes=1 --ntasks-per-node=1
cd $SLURM_SUBMIT_DIR
time ./a.out >& logfile
and a simple parallel (MPI) example
#!/bin/bash -l
# ask for 16 cores on two nodes
#SBATCH --nodes=2 --ntasks-per-node=4
cd $SLURM_SUBMIT_DIR
/bin/hostname
mpirun -np $SLURM_NTASKS ./a.out
You run this job with 'sbatch test.slurm'
You can check on the status with squeue e.g.
squeue -u $USER (to see only your jobs, 'squeue' will show every job on the system)
You can look at details with 'scontrol show job JIOBID'
To kill a job you use 'scancel -i JOBID'
If you want an interactive node to test some things to make sure your job will run, you can do this with
srun -N 1 -p short --ntasks-per-node=2 --pty bash
You can use the srun command to run a command within the cluster such as, srun python3 pytorch.py or srun ./pytorch.py if the Python script is executable and has the python interpreter set.
Note: the file name in this example is arbitrary as well as the command or commands after the #SBATCH declarations. You most likely will not be using pytorch for class. Many lab and research systems have multiple versions of python installed so you will have to be specific.
The more important command would be sbatch. You can create an sbatch file and setup your parameters in there and submit the job and retrieve the output at a later point in time. Below is the example sbatch file used to test the system.
To create the file:
cat sbatch.batch
example contents of file:
Then run
Note: the file name in this example is arbitrary as well as the command or commands after the #SBATCH declarations. You most likely will not be using pytorch for class. Many lab and research systems have multiple versions of python installed so you will have to be specific.
Once you've submitted your job it will tell you the job id. You can follow the job's progress with squeue and the stdout will be logged into slurm-jobid.out.
I need a particular module or library
If you need a specific module or library, it may be possible to create a virtual environment or similar by following the knowledge article linked below. The commands will need to be put into an sbatch file and executed on the cluster. This is a required step because the Slurm cluster is running Ubuntu and the Linux labs run either Fedora or Centos.
If this doesn’t resolve the need, it is recommended to talk with your TA and you or they can open a help request by emailing help@engineering.ucsb.edu. Please include the reason for the software need and any supportive material such as screenshots or console outputs with the error you are receiving.