Introduction to SLURM scitas.epfl.ch October 9, 2014

Introduction to SLURM
scitas.epfl.ch
October 9, 2014
Bellatrix
I
Frontend at bellatrix.epfl.ch
I
16 x 2.2 GHz cores per node
I
424 nodes with 32GB
I
Infiniband QDR network
I
The batch system is SLURM
1 / 36
Castor
I
Frontend at castor.epfl.ch
I
16 x 2.6 GHz cores per
node
I
50 nodes with 64GB
I
2 nodes with 256GB
I
For sequential jobs
(Matlab etc.)
I
The batch system is
SLURM
I
RedHat 6.5
2 / 36
Deneb (October 2014)
I
Frontend at deneb.epfl.ch
I
16 x 2.6 GHz cores per node
I
376 nodes with 64GB
I
8 nodes with 256GB
I
2 nodes with 512GB and 32 cores
I
16 nodes with 4 Nvidia K40 GPUs
I
Infiniband QDR network
3 / 36
Storage
/home
I
filesystem has per user quotas
I
will be backed up for important things
(source code, results and theses)
/scratch
I
high performance ”temporary” space
I
is not backed up
I
is organised by laboratory
4 / 36
Connection
Start the X server (automatic on a Mac)
Open a terminal
ssh -Y username@castor.epfl.ch
Try the following commands:
I
id
I
pwd
I
quota
I
ls /scratch/<group>/<username>
5 / 36
The batch system
Goal: to take a list of jobs and execute them when
appropriate resources become available
SCITAS uses SLURM on its clusters:
http://slurm.schedmd.com
The configuration depends on the purpose of the cluster
(serial vs parallel)
6 / 36
sbatch
The fundamental command is sbatch
sbatch submits jobs to the batch system
Suggested workflow:
I
create a short job-script
I
submit it to the batch system
7 / 36
sbatch - exercise
Copy the first two examples to your home directory
cp /scratch/examples/ex1.run .
cp /scratch/examples/ex2.run .
Open the file ex1.run with your editor of choice
8 / 36
ex1.run
#!/bin/bash
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
--workdir /scratch/<group>/<username>
--nodes 1
--ntasks 1
--cpus-per-task 1
--mem 1024
sleep 10
echo "hello from $(hostname)"
sleep 10
9 / 36
ex1.run
#SBATCH is a directive to the batch system
--nodes 1
the number of nodes to use - on Castor this is limited to 1
--ntasks 1
the number of tasks (in an MPI sense) to run per job
--cpu-per-task 1
the number of cores per aforementioned task
--mem 4096
the memory required per node in MB
--time 12:00:00
--time 2-6
the time required
# 12 hours
# two days and six hours
10 / 36
Running ex1.run
The job is assigned a default runtime of 15 minutes
$ sbatch ex1.run
Submitted batch job 439
$ cat /scratch/<group>/<username>/slurm-439.out
hello from c03
11 / 36
What went on?
sacct -j
<JOB_ID>
sacct -l -j
<JOB_ID>
Or more usefully:
Sjob <JOB ID>
12 / 36
Cancelling jobs
To cancel a specific job:
scancel <JOB_ID>
To cancel all your jobs:
scancel -u <username>
13 / 36
ex2.run
#!/bin/bash
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
--workdir /scratch/<group>/<username>
--nodes 1
--ntasks 1
--cpus-per-task 8
--mem 122880
--time 00:30:00
/scratch/examples/linpack/runme_1_45k
14 / 36
What’s going on?
squeue
squeue -u <username>
Squeue
Sjob <JOB_ID>
scontrol -d show job <JOB_ID>
sinfo
15 / 36
squeue and Squeue
squeue
I
Job states: Pending, Resources, Priority, Running
squeue | grep <JOB_ID>
squeue -j <JOB_ID>
Squeue <JOB_ID>
16 / 36
Sjob
$ Sjob <JOB_ID>
JobID
JobName
Cluster
Account Partition Timelimit
User
Group
------------ ---------- ---------- ---------- ---------- ---------- --------- --------31006
ex1.run
castor scitas-ge
serial
00:15:00
jmenu scitas-ge
31006.batch
batch
castor scitas-ge
Submit
Eligible
Start
End
------------------- ------------------- ------------------- ------------------2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:56:08
2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:56:08
Elapsed ExitCode
State
---------- -------- ---------00:00:20
0:0 COMPLETED
00:00:20
0:0 COMPLETED
NCPUS
NTasks
NodeList
UserCPU SystemCPU
AveCPU MaxVMSize
---------- -------- --------------- ---------- ---------- ---------- ---------1
c04
00:00:00 00:00.001
1
1
c04
00:00:00 00:00.001
00:00:00
207016K
17 / 36
scontrol
$ scontrol -d show job <JOB_ID>
$ scontrol -d show job 400
obId=400 Name=s1.job
UserId=user(123456) GroupId=group(654321)
Priority=111 Account=scitas-ge QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:03:39 TimeLimit=00:15:00 TimeMin=N/A
SubmitTime=2014-03-06T09:45:27 EligibleTime=2014-03-06T09:45:27
StartTime=2014-03-06T09:45:27 EndTime=2014-03-06T10:00:27
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=serial AllocNode:Sid=castor:106310
ReqNodeList=(null) ExcNodeList=(null)
NodeList=c03
BatchHost=c03
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
Nodes=c03 CPU IDs=0 Mem=1024
MinCPUsNode=1 MinMemoryCPU=1024M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/<user>/jobs/s1.job
WorkDir=/scratch/<group>/<user>
18 / 36
Modules
Modules make your life easier
I
module avail
I
module show <take your pick>
I
module load <take your pick>
I
module list
I
module purge
I
module list
19 / 36
ex3.run - Mathematica
Copy the following files to your chosen directory:
cp /scratch/examples/ex3.run .
cp /scratch/examples/mathematica.in .
Submit ‘ex3.run’ to the batch system and see what happens...
20 / 36
ex3.run
#!/bin/bash
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
--ntasks 1
--cpus-per-task 1
--nodes 1
--mem 4096
--time 00:05:00
echo STARTING AT ‘date‘
module purge
module load mathematica/9.0.1
math < mathematica.in
echo FINISHED at ‘date‘
21 / 36
Compiling ex4.* source files
Copy the following files to your chosen directory:
/scratch/examples/ex4_README.txt
/scratch/examples/ex4.c
/scratch/examples/ex4.cxx
/scratch/examples/ex4.f90
/scratch/examples/ex4.run
Then compile them with:
module load intelmpi/4.1.3
mpiicc -o ex4 c ex4.c
mpiicpc -o ex4 cxx ex4.cxx
mpiifort -o ex4 f90 ex4.f90
22 / 36
The 3 methods to get interactive access 1/3
In order to schedule an allocation use salloc with exactly the
same options for resources as sbatch
You will then arrive in a new prompt which is still on the
submission node but by using srun you can get access to the
allocated resources
eroche@castor:hello > salloc -N 1 -n 2
salloc: Granted job allocation 1234
bash-4.1$ hostname
castor
bash-4.1$ srun hostname
c03
c03
23 / 36
The 3 methods to get interactive access 2/3
To get a prompt on the machine one needs to use the “--pty”
option with “srun” and then “bash -i” (or “tcsh -i”) to get
the shell:
eroche@castor > salloc -N 1 -n 1
salloc: Granted job allocation 1235
eroche@castor > srun --pty bash -i
bash-4.1$ hostname
c03
24 / 36
The 3 methods to get interactive access 3/3
This is the least elegant but it is the method by which one can run
X11 applications:
eroche@bellatrix > salloc -n 1 -c 16 -N 1
salloc: Granted job allocation 1236
bash-4.1$ srun hostname
c04
bash-4.1$ ssh -Y c04
eroche@c04 >
25 / 36
Dynamic libs used in an application
“ldd” displays the libraries an executable file depends on:
/COURS > ldd ex4 f90
jmenu@castor:~
linux-vdso.so.1 => (0x00007fff4b905000)
libmpigf.so.4 =>
/opt/software/intel/14.0.1/intel64/lib/libmpigf.so.4
(0x00007f556cf88000)
libmpi.so.4 => /opt/software/intel/14.0.1/intel64/lib/libmpi.so.4
(0x00007f556c91c000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003807e00000)
librt.so.1 => /lib64/librt.so.1 (0x0000003808a00000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003808200000)
libm.so.6 => /lib64/libm.so.6 (0x0000003807600000)
libc.so.6 => /lib64/libc.so.6 (0x0000003807a00000)
libgcc s.so.1 => /lib64/libgcc s.so.1 (0x000000380c600000)
/lib64/ld-linux-x86-64.so.2 (0x0000003807200000)
26 / 36
ex4.run
#!/bin/bash
...
module purge
module load intelmpi/4.1.3
module list echo
LAUNCH DIR=/scratch/scitas-ge/jmenu
EXECUTABLE="./ex4 f90"
echo "--> LAUNCH DIR = ${LAUNCH DIR}"
echo "--> EXECUTABLE = ${EXECUTABLE}" echo
echo "--> ${EXECUTABLE} depends on the following dynamic
libraries:"
ldd ${EXECUTABLE}
echo
cd ${LAUNCH DIR}
srun ${EXECUTABLE}
...
27 / 36
The debug QoS
In order to have priority access for debugging
sbatch --qos debug ex1.run
Limits on Castor:
I
30 minutes walltime
I
1 job per user
I
16 cores between all users
To display the available QoS’s:
sacctmgr show qos
28 / 36
cgroups (Castor)
General:
I
cgroups (“control groups”) is a Linux kernel feature to limit,
account, and isolate resource usage (CPU, memory, disk I/O,
etc.) of process groups
SLURM:
I
Linux cgroups apply contraints to the CPUs and memory that
can be used by a job
I
They are automatically generated using the resource requests
given to SLURM
I
They are automatically destroyed at the end of the job, thus
releasing all resources used
Even if there is physical memory available a task will be killed if
it tries to exceed the limits of the cgroup!
29 / 36
System process view
Two tasks running on the same node with “ps auxf”
root
user
user
177873
177877
177908
slurmstepd: [1072]
\_ /bin/bash /var/spool/slurmd/job01072/slurm_script
\_ sleep 10
root
user
user
177890
177894
177970
slurmstepd: [1073]
\_ /bin/bash /var/spool/slurmd/job01073/slurm_script
\_ sleep 10
Check memory, thread and core usage with “htop”
30 / 36
Fair share 1/3
The scheduler is configured to give all groups a share of the
computing power
Within each group the members have an equal share by default:
jmenu@castor:~ > sacctmgr show association where account=lacal
format=Account,Cluster,User,GrpNodes, QOS,DefaultQOS,Share tree
Account
Cluster
User GrpNodes
QOS
Def QOS
Share
-------------------- ---------- ---------- -------- -------------------- --------- --------lacal
castor
normal
1
lacal
castor
aabecker
debug,normal
normal
1
lacal
castor
kleinjun
debug,normal
normal
1
lacal
castor
knikitin
debug,normal
normal
1
Priority is based on recent usage
I
this is forgotten about with time (half life)
I
fair share comes into play when the resources are heavily used
31 / 36
Fair share 2/3
Job priority is a weighted sum of various factors:
jmenu@castor:~ > sprio -w
JOBID
Weights
PRIORITY
AGE
1000
jmenu@bellatrix:~ > sprio -w
JOBID
PRIORITY
Weights
FAIRSHARE
10000
AGE
1000
QOS
100000
FAIRSHARE
10000
JOBSIZE
100
To compare jobs’ priorities:
jmenu@castor:~ > sprio -j80833,77613
JOBID
PRIORITY
AGE FAIRSHARE
77613
145
146
0
80833
9204
93
9111
32 / 36
QOS
0
0
QOS
100000
Fair share 3/3
FairShare values range from 0.0 to 1.0:
Value
Meaning
≈ 0.0
you used much more resources that you were granted
0.5
≈ 1.0
you got what you paid for
you used nearly no resources
jmenu@bellatrix:~ > sshare -a -A lacal
Accounts requested:
: lacal
Account
User Raw Shares Norm Shares
Raw Usage Effectv Usage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ---------lacal
666
0.097869 1357691548
0.256328
0.162771
lacal
boissaye
1
0.016312
0
0.042721
0.162771
lacal
janson
1
0.016312
0
0.042721
0.162771
lacal
jetchev
1
0.016312
0
0.042721
0.162771
lacal
kleinjun
1
0.016312 1357691548
0.256328
0.000019
lacal
pbottine
1
0.016312
0
0.042721
0.162771
lacal
saltini
1
0.016312
0
0.042721
0.162771
More information at:
http://schedmd.com/slurmdocs/priority_multifactor.html
33 / 36
Helping yourself
man pages are your friend!
I
man sbatch
I
man sacct
I
man gcc
module load intel/14.0.1
I
man ifort
34 / 36
Getting help
If you still have problems then send a message to:
1234@epfl.ch
Please start the subject with HPC
for automatic routing to the HPC team
Please give as much information as possible including:
I
the jobid
I
the directory location and name of the submission script
I
where the “slurm-*.out” file is to be found
I
how the “sbatch” command was used to submit it
I
the output from “env” and “module list” commands
35 / 36
Appendix
Change your shell at:
https://dinfo.epfl.ch/cgi-bin/accountprefs
Scitas web site:
http://scitas.epfl.ch
36 / 36