A lot of people have recently been asking me about help with the SLURM system at UCSD, so I decided to write a blog post to help people quickly learn how to use it and get their jobs running. I have a simple script that can be executed via the sbatch command that forks jobs in a job array. This allows you to run multiple of the same jobs in parallel on the cluster.
So say you have a bunch of scripts you want to run in parallel on a SLURM system. We can put all the script commands into a file (calls.txt):
calls.txt:
script1.sh script2.sh script3.sh
Then we can create an sbatch script that specifies a job array that runs through these commands and executes them in a parallel job array. Here is the job array script for sbatch (runTest.sh):
runTest.sh:
#!/bin/bash #SBATCH --account=xxxx #SBATCH --partition=shared #SBATCH -N 1 # Ensure that all cores are on one machine #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 #SBATCH -t 0-00:10 # Runtime in D-HH:MM #SBATCH -o outputfile%a # File to which STDOUT will be written #SBATCH -e outputerr%a # File to which STDERR will be written #SBATCH --job-name=TestJob #SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL #SBATCH --mail-user=xxxx@ucsd.edu # Email to which notifications will be sent #SBATCH --array=1-3%3 linevar=`sed $SLURM_ARRAY_TASK_ID'q;d' calls.txt` eval $linevar
As one can see, the sbatch script has several parameters. Some of the useful ones that one should specify:
- account: The university PI account on SLURM (different for each lab)
- partition: shared vs. compute. “shared” is generally what people use if you just want to use the cluster for straight-forward jobs. “compute” specifies a more reserved use of the cluster that needs more time (but also costs more).
- nodes: specifies how many nodes to use (each node has 24 cores)
- ntasks-per-node: how many tasks there are (in this case, I’m treating this all as 1 task).
- cpus-per-task: How many cores to use per task in the job array.
- runtime (how long to run any 1 job). Runtime is specified in D-HH:MM
- -o specifies that outputfiles and -e specifies error files. You can just keep these as is for now (and change once you try running it if need be).
- SLURM has the option of sending you email notifications. See comments in script
- array=1-3%3 specifies that the job array contains three jobs (i.e. calls.txt has three lines) and %3 specifies they should be run 3 at a time (i.e. all at once). If you have 50 jobs you want to run 10 at a time, the command is array=1-50%10.
All the job array script does is step through calls.txt and fork them to the cluster in batches that are specified by the “array” command.
Then on Comet/TSCC,all one has to do is:
sbatch runTest.sh
You can do the following command to check the status of your job array in the Comet or TSSC queue:
squeue -u yourusername
Hopefully this makes use of SLURM a lot simpler for everyone. Cheers!